Nutch 2.1 (HBase, SOLR) with Amazon Web Services - solr

I experienced Nutch 2.1 locally without any difficulty. I have also tried on a 3 machine distributed cluster. We're now discussing whether to run it with Amazon Web Services or not. I do not have much experience with AWS. My question is that, is it possible and neccessary to try Nutch2.1 crawling and indexing parts on the cloud. What possible advantages and disadvantages we will have?
Thanks.

If you have a cluster with same capacity as that of a AWS cluster (that you plan to invest in) then there is no advantage except for #1 below.
Here are several factors that you should think about before switching to AWS:
Locality of hosts crawled: If you are sitting in Europe and the websites that you want to crawl are hosted far away ... say Australia. If you buy AWS nodes located in Australia, it would be much faster for crawling that data rather than crawling from Europe.
Cost: For using AWS machines, you need to pay then on hourly basis. Can you afford that ? If not better use your own machines
Current cluster capacity : does your current cluster has ample capacity and space to handle the amount of crawled data ? I think there wont be problem in terms of computational speed as Nutch runs on Hadoop which was designed to run on commodity hardware. Can your cluster accommodate entire data that is being fetched by the crawler.
Volume of data : What is a rough estimate of the data that is being crawled ? If its less, then it makes no sense to have an AWS cluster.
Time constraints : Is there any time bound for completion for the crawl ?
If you are doing this for a professional project, then these factors must be given a thought.
If you are doing it for fun/hobby/learning, go ahead and use free tier nodes of AWS. Those are low capacity nodes given free by Amazon. Its fun to learn new things :)
Advantages of AWS:
No need to buy machines for setting up a cluster. get started without having any hardware except a terminal PC.
Locality
No need to look after machines. If a node crashes badly, leave it (its not your problem :P). Buy a new one, add it to the cluster and go ahead.
Disadvantages of AWS:
Costly.
Copying data to any machine outside AWS cluster is charged.
Your data is NOT persisted when u give up the procured AWS nodes. If u want to persist it, pay them and use the S3 storage service.

Related

What is the smallest AWS EC2 instance I can run a postgres db on?

There is the free tier on AWS and I can get a micro EC2 instance for free essentially.. or close to.. I'm sure setting up elastic ips - loads balancers etc is extra.
Would it effectively be possible for me to run a postgres DB - for a small api. Roughly about 50 inserts + 50 reads per second ... say about 6000 operations per min at the most.
I can't seem to find anything online - which makes me think that this might be a silly idea.
For this not to be an "open question" - it's simply: Is it possible and realistic to expect usable performance on an EC2 instance running my postgres DB.
The best way to determine whether the database can handle a particular workload is to test it at that capacity. Launch the database, simulate traffic and monitor its performance. Please note that every application uses a database differently, so nobody can provide "general advice" as to whether a particular-sized database would meet the needs of your particular application.
If you are going to run 'production' workloads, try to avoid using the Burstable performance instances (T2, T3) since they can hit limits under heavy workloads unless the 'Unlimited' option is selected. T2/T3 is great for bursty workloads, but not for sustained workloads.
Comparing m5.xlarge between EC2 and RDS:
Amazon EC2: 19.2c/hr ($4.61/day)
Amazon RDS: 35.6c/hr ($8.54/day)
For the additional price, Amazon RDS provides a fully-managed database, automated backups, CloudWatch metrics, etc. This is probably worth much more than $4 of your time every day.
Alternatively, if you can modify your application to use NoSQL instead of SQL, you could use Amazon DynamoDB where the capacity you mention would cost 4c/hour ($1/day) plus request and storage costs.
Don't underspend on your database — it powers everything you do. Instead, try to save money by turning off non-production systems when they aren't being used (eg weekends and evenings). That will hopefully give you enough savings to afford an appropriately-powered database.

Web scraping vs Cloud Storage with AWS

My team has run into a design conflict. We are working on a project that involves scraping historical data from yahoo for all stocks for the last year to run some ML analysis on it. The latency is unbearably slow, not sure if it's the network or the web scraper. I proposed we use AWS RDS to store the data so we can access it quicker. However, a team member said that storing the data in the cloud would not solve our latency issue. I rebutted with the fact that the data will be organized and stored in a way to access the data significantly faster. He came back with something else and this went on. Is it true that a cloud DB won't offer any additional speed compared to a scraper? If so does AWS have a service that allows us to access the data we store faster through another service, almost as if the database was on our own server?
I am not that all familiar with cloud services but I do understand databases pretty well. So please dumb down the AWS stuff if you wish and feel free to point me to any duplicates or links that may help me understand this more.
Lots of good reasons to use RDS as a database, but speeding up your scraping isn't one of them - it likely isn't your bottleneck.
I have written lots of scrapers over the years, and by far the biggest performance boost will be to have a fast network connection between the scraper machine(s) and the host you are scraping, and even then, using a multi-threaded scraper for each scraping machine will give you another HUGE speed improvement.
Most time spent scraping is waiting on the host to return the results to you, not parsing the page and not saving the database to a database.
A MySQL DB on AWS RDS would be the same as the one that you'd install yourself on some machine. So, it isn't going to be different or slower just because it is in the cloud.
If you scrape some data and process it only once, then there is no point in introducing a DB in between. But if your scraper is slow and you process scraped data multiple times, then storing it in a DB should improve latencies. That is because the latencies of a DB read will be much lesser than that of scraping (assuming you design your DB schema properly; your hosts are in the same availability zones, or at least regions, as your DB etc.).
For e.g., if scraping a webpage takes ~10s and you process the scraped data twice, it'd take you ~20s if you don't have a DB. If you have a DB which has latencies of ~500ms you'd only take ~11s.

estimate server cost using aws

pardon me if this isn't the place to ask such question. But I have finished my project and thinking to deploy it using amazon elastic beanstalk and got a huge worry. My worry is that my project's database can be humongous. It's a community website like a reddit that users can create a page that other people can post text,link,pic,video(youtube). Also users get a profile page, and are able to comment as well. This was my first big project, and I don't want to pay more than $200 for server fee every month.
should I still deploy this? or just be happy I proved myself I can make this? how much do you think I'll have to pay assuming I get about (max)100 users?
For starters, you can look at the costs for any AWS service by going to that services 'homepage' and clicking "Pricing" usually on the left side. I typically get to the pricing page by Googling "AWS <> Pricing" (e.g. "AWS EC2 Pricing").
Whether or not you incur any cost and what that cost is, really depends on how you deploy your website. Questions like, is your database self-managed (i.e. installed on your own EC2 instance) or are you using RDS? Are you using S3 to store static content? Will you be serving your web contents via Cloudfront (AWS' CDN)?
Many of the basic services (EC2, S3, RDS, etc.) have free-tiers which will allow you to use them for free, provided you stay within certain (and usually very low) levels of usage.
If your database is going to get VERY large, and cost is your primary concern, it's usually more cost-effective to manage it on your own EC2 server, however then things like updates, security, scaling, backups, etc. all become your problem to deal with and often can incur additional cost (i.e. your backups will likely require volume snapshots which will cost you vs. RDS' backups are free).
If you're going to have a significant amount of static content, it will be more cost effective to host it on your own EC2 server, however again, all maintenance will be your responsibility, such as backups and scaling (which can incur cost) to meet demand vs. all of that is taken care of by S3, though you pay each time a file is accessed.
If cost is your primary concern, my suggestion is to start out your development using the AWS services (RDS, S3 maybe Elastic Beanstalk), though that can often complexity to your development efforts (dealing with authentication, additional SDK's, etc.). You can typically and pretty easily roll out your own service (MySQL, EBS filesystem to replace S3, etc.) as replacement. Additionally, again depending on your roll-out, there can be network traffic costs. Usually, this isn't a problem if you're doing things the way Amazon wants you to, but...it wouldn't be unheard of.
To get you started:
https://aws.amazon.com/s3/pricing/
https://aws.amazon.com/ec2/pricing/
https://aws.amazon.com/rds/pricing/
Additionally, there is a nifty calculator which can help you estimate your costs. You will need to know what your traffic expectations as well as service requirements are, but you can play around with those numbers.
https://calculator.s3.amazonaws.com/index.html
You don't have to worry about charges. As AWS has a Free tier which offers most of the services free for 12 months.
https://aws.amazon.com/free/
You can have 01 t2.micro (1 vCpu + 1 gb RAM) instance for Elastic beanstalk (EBS) with auto scaling turned off, purchase Reserved instances by making 1 or 3 year all upfront payment to save more.
https://aws.amazon.com/ec2/purchasing-options/reserved-instances/
You must use Relational Database Service (RDS) for database, instead of installing db in your EBS instance & store files on Simple Storage Service (S3).
RDS Pricing is $0.100 per GB-month, after first 12 months. So you dont need to worry about the database size.
After first 12 months, your monthly bill would be less than USD 50.
On production, we are using 2 t2.micro instances (Windows) with 1 Ms Sql database on RDS & we only have to pay for the extra EC2 machine, about USD 14 per month.
I did find some relevant information in the below stackoverflow thread which talks about the capabilities of t2.micro,t2.small & t2.medium ec2 instances.Do have a look at it.
several t2.micro better than a single t2.small or t2.medium

MapReduce in the cloud

Except for Amazon MapReduce, what other options do I have to process a large amount of data?
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
DataStax Brisk is good.
Full-on distributions
Apache Hadoop
Cloudera’s Distribution including Apache Hadoop (that’s the official name)
IBM Distribution of Apache Hadoop
DataStax Brisk
Amazon Elastic MapReduce
HDFS alternatives
Mapr
Appistry CloudIQ Storage Hadoop Edition
IBM Global Parallel File System (GPFS)
CloudStore
Hadoop MapReduce alternatives
Pervasive DataRush
Cascading
Hive (an Apache subproject, included in Cloudera’s distribution)
Pig (a Yahoo-developed language, included in Cloudera’s distribution)
Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/
If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource'd from twitter recently
Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.
Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.
There are other efforts like Hadoop Online which allows to process using pipeline.
Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It's extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).
If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.
However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.
The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that's laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster's HDFS on a regular basis.
Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/

Are there any "gotchas" in deploying a Cassandra cluster to a set of Linode VPS instances?

I am learning about the Apache Cassandra database [sic].
Does anyone have any good/bad experiences with deploying Cassandra to less than dedicated hardware like the offerings of Linode or Slicehost?
I think Cassandra would be a great way to scale a web service easily to meet read/write/request load... just add another Linode running a Cassandra node to the existing cluster. Yes, this implies running the public web service and a Cassandra node on the same VPS (which many can take exception with).
Pros of Linode-like deployment for Cassandra:
Private VLAN; the Cassandra nodes could communicate privately
An API to provision a new Linode (and perhaps configure it with a "StackScript" that installs Cassandra and its dependencies, etc.)
The price is right
Cons:
Each host is a VPS and is not dedicated of course
The RAM/cost ratio is not that great once you decide you want 4GB RAM (cf. dedicated at say SoftLayer)
Only 1 disk where one would prefer 2 disks I suppose (1 for the commit log and another disk for the data files themselves). Probably moot since this is shared hardware anyway.
EDIT: found this which helps a bit: http://wiki.apache.org/cassandra/CassandraHardware
I see that 1GB is the minimum but is this a recommendation? Could I deploy with a Linode 720 for instance (say 500 MB usable to Cassandra)? See http://www.linode.com/
How much ram you needs really depends on your workload: if you are write-mostly you can get away with less, otherwise you will want ram for the read cache.
You do get more ram for you money at my employer, rackspace cloud: http://www.rackspacecloud.com/cloud_hosting_products/servers/pricing. (our machines also have raided disks so people typically see better i/o performance vs EC2. Dunno about linode.)
Since with most VPSes you pay roughly 2x for the next-size instance, i.e., about the same as adding a second small instance, I would recommend going with fewer, larger instances than more, smaller ones, since in small numbers network overhead is not negligible.
I do know someone using Cassandra on 256MB VMs but you're definitely in the minority if you go that small.

Resources