I have a 2 GB database and a front end that will likely handle 10-15 hits during the day. Is the AWS (with MySQL RDS) free-tier a good place to start?
Will CakePHP apps encounter time-outs or other resource issues, due to sizing of the Micro Instance?
Micro Instance (from Amazon): Micro instances (t1.micro) provide a small amount of
consistent CPU resources and allow you to increase CPU capacity in
short bursts when additional cycles are available. They are well
suited for lower throughput applications and web sites that require
additional compute cycles periodically. You can learn more about how
you can use Micro instances and appropriate applications in the Amazon
EC2 documentation.
Micro Instance 613 MiB of memory, up to 2 ECUs (for short periodic
bursts), EBS storage only, 32-bit or 64-bit platform
If you're only getting a very small amount of hits, you can probably run your application and mysql database on a micro instance.
The micro will be free, but you will have to pay for the RDS.
You should not notice any issues - we do most of our testing on micros, and our database is larger than yours.
It will work perfectly for your scenario.
I have deployed myself for other clients applications with at least twice requirements as yours and they worked fine.
If your application does operations saving and retrieving files from the disk I would like to suggest you giving a try to Amazon S3.
Related
Currently clouds are mushrooming like crazy and people start to deploy everything to the cloud including CMS systems, but so far I have not seen people that have succeeded in deploying popular CMS systems to a load balanced cluster in the cloud. Some performance hurdles seem to prevent standard open-source CMS systems to be deployed to the cloud like this.
CLOUD: A cloud, better load-balanced cluster, has at least one frontend-server, one network-connected(!) database-server and one cloud-storage server. This fits well to Amazon Beanstalk and Google Appengine. (This specifically excludes CMS on a single computer or Linux server with MySQL on the same "CPU".)
To deploy a standard CMS in such a load balanced cluster needs a cloud-ready CMS with the following characteristics:
The CMS must deal with the latency of queries to still be responsive and render pages in less than a second to be cached (or use a precaching strategy)
The filesystem probably must be connected to a remote storage (Amazon S3, Google cloudstorage, etc.)
Currently I know of python/django and Wordpress having middleware modules or plugins that can connect to cloud storages instead of a filesystem, but there might be other cloud-ready CMS implementations (Java, PHP, ?) and systems.
I myself have failed to deploy django-CMS to the cloud, finally due to query latency of the remote DB. So here is my question:
Did you deploy an open-source CMS that still performs well in rendering pages and backend admin? Please post your average page rendering access stats in microseconds for uncached pages.
IMPORTANT: Please describe your configuration, the problems you have encountered, which modules had to be optimized in the CMS to make it work, don't post simple "this works", contribute your experience and knowledge.
Such a CMS probably has to make fewer than 10 queries per page, if more, the queries must be made in parallel, and deal with filesystem access times of 100ms for a stat and query delays of 40ms.
Related:
Slow MySQL Remote Connection
Have you tried Umbraco?
It relies on database, but it keeps layers of cache so you arent doing selects on every request.
http://umbraco.com/azure
It works great on azure too!
I have found an excellent performance test of Wordpress on Appengine. It appears that Google has spent some time to optimize this system for load-balanced cluster and remote DB deployment:
http://www.syseleven.de/blog/4118/google-app-engine-php/
Scaling test from the report.
parallel
hits GAE 1&1 Sys11
1 1,5 2,6 8,5
10 9,8 8,5 69,4
100 14,9 - 146,1
Conclusion from the report the system is slower than on traditional hosting but scales much better.
http://developers.google.com/appengine/articles/wordpress
We have managed to deploy python django-CMS (www.django-cms.org) on GoogleAppEngine with CloudSQL as DB and CloudStore as Filesystem. Cloud store was attached by forking and fixing a django.storage module by Christos Kopanos http://github.com/locandy/django-google-cloud-storage
After that, the second set of problems came up as we discovered we had access times of up to 17s for a single page access. We have investigated this and found that easy-thumbnails 1.4 accessed the normal file system for mod_time requests while writing results to the store (rendering all thumb images on every request). We switched to the development version where that was already fixed.
Then we worked with SmileyChris to fix unnecessary access of mod_times (stat the file) on every request for every image by tracing and posting issues to http://github.com/SmileyChris/easy-thumbnails
This reduced access times from 12-17s to 4-6s per public page on the CMS basically eliminating all storage/"file"-system access. Once that was fixed, easy-thumbnails replaced (per design) file-system accesses with queries to the DB to check on every request if a thumbnail's source image has changed.
One thing for the web-designer: if she uses a image.width statement in the template this forces a ugly slow read on the "filesystem", because image widths are not cached.
Further investigation led to the conclusion that DB accesses are very costly, too and take about 40ms per roundtrip.
Up to now the deployment is unsuccessful mostly due to DB access times in the cloud leading to 4-5s delays on rendering a page before caching it.
I experienced Nutch 2.1 locally without any difficulty. I have also tried on a 3 machine distributed cluster. We're now discussing whether to run it with Amazon Web Services or not. I do not have much experience with AWS. My question is that, is it possible and neccessary to try Nutch2.1 crawling and indexing parts on the cloud. What possible advantages and disadvantages we will have?
Thanks.
If you have a cluster with same capacity as that of a AWS cluster (that you plan to invest in) then there is no advantage except for #1 below.
Here are several factors that you should think about before switching to AWS:
Locality of hosts crawled: If you are sitting in Europe and the websites that you want to crawl are hosted far away ... say Australia. If you buy AWS nodes located in Australia, it would be much faster for crawling that data rather than crawling from Europe.
Cost: For using AWS machines, you need to pay then on hourly basis. Can you afford that ? If not better use your own machines
Current cluster capacity : does your current cluster has ample capacity and space to handle the amount of crawled data ? I think there wont be problem in terms of computational speed as Nutch runs on Hadoop which was designed to run on commodity hardware. Can your cluster accommodate entire data that is being fetched by the crawler.
Volume of data : What is a rough estimate of the data that is being crawled ? If its less, then it makes no sense to have an AWS cluster.
Time constraints : Is there any time bound for completion for the crawl ?
If you are doing this for a professional project, then these factors must be given a thought.
If you are doing it for fun/hobby/learning, go ahead and use free tier nodes of AWS. Those are low capacity nodes given free by Amazon. Its fun to learn new things :)
Advantages of AWS:
No need to buy machines for setting up a cluster. get started without having any hardware except a terminal PC.
Locality
No need to look after machines. If a node crashes badly, leave it (its not your problem :P). Buy a new one, add it to the cluster and go ahead.
Disadvantages of AWS:
Costly.
Copying data to any machine outside AWS cluster is charged.
Your data is NOT persisted when u give up the procured AWS nodes. If u want to persist it, pay them and use the S3 storage service.
I'm currently looking for a Cloud PaaS that will allow me to scale an application to handle anything between 1 user and 10 Million+ users ... I've never worked on anything this big and the big question that I can't seem to get a clear answer for is that if you develop, let's say a standard application with a relational database and soap-webservices, will this application scale automatically when deployed on a Paas solution or do you still need to build the application with fall-over, redundancy and all those things in mind?
Let's say I deploy a Spring Hibernate application to Amazon EC2 and I create single instance of Ubuntu Server with Tomcat installed, will this application just scale indefinitely or do I need more Ubuntu instances? If more than one Ubuntu instance is needed, does Amazon take care of running the application over both instances or is this the developer's responsibility? What about database storage, can I install a database on EC2 that will scale as the database grow or do I need to use one of their APIs instead if I want it to scale indefinitely?
CloudFoundry allows you to build locally and just deploy straight to their PaaS, but since it's in beta, there's a limit on the amount of resources you can use and databases are limited to 128MB if I remember correctly, so this a no-go for now. Some have suggested installing CloudFoundry on Amazon EC2, how does it scale and how is the database layer handled then?
GAE (Google App Engine), will this allow me to just deploy an app and not have to worry about how it scales and implements redundancy? There appears to be some limitations one what you can and can't run on GAE and their price increase recently upset quite a large number of developers, is it really that expensive compared to other providers?
So basically, will it scale and what needs to be done to make it scale?
That's a lot of questions for one post. Anyway:
Amazon EC2 does not scale automatically with load. EC2 is basically just a virtual machine. You can achieve scaling of EC2 instances with Auto Scaling and Elastic Load Balancing.
SQL databases scale poorly. That's why people started using NoSQL databases in the first place. It's best to see which database your cloud provider offers as a managed service: Datastore on GAE and DynamoDB on Amazon.
Installing your own database on EC2 instances is very impractical as EC2 has ephemeral storage (it looses all data on "disk" when it reboots).
GAE Datastore is actually a one big database for all applications running on it. So it's pretty scalable - your million of users should not be a problem for it.
http://highscalability.com/blog/2011/1/11/google-megastore-3-billion-writes-and-20-billion-read-transa.html
Yes App Engine scales automatically, both frontend instances and database. There is nothing special you need to do to make it scale, just use their API.
There are limitations what you can do with AppEngine:
A. No local storage (filesystem) - you need to use Datastore or Blobstore.
B. Comet is only supported via their proprietary Channels API
C. Datastore is a NoSQL database: no JOINs, limited queries, limited transactions.
Cost of GAE is not bad. We do 1M requests a day for about 5 dollars a day. The biggest saving comes from the fact that you do not need a system admin on GAE ( but you do need one for EC2). Compared to the cost of manpower GAE is incredibly cheap.
Some hints to save money (an speed up) GAE:
A. Use get instead of query in Datastore (requires carefully crafting natiral keys).
B. Use memcache to cache data you got form datastore. This can be done automatically with objectify and it's #Cached annotation.
C. Denormalize data. Meaning you write data redundantly in various places in order to get to it in as few operations as possible.
D. If you have a lot of REST requests from devices, where you do not use cookies, then switch off session support ( or roll your own as we did). Sessions use datastore under the hood and for every request it does get and put.
E. Read about adjusting app settings. Try different settings (depending how tolerant your app is to request delay and your traffic patterns/spikes). We were able to cut down frontend instances by 70%.
I have an idea for a web application and I am currently researching different platforms. I am really interested in Google App Engine, but it looks like it works pretty good for certain application types while it is less suitable for others (there are horror as well as success stories e.g. Goodbye Google App Engine vs. Why we are really happy with Google App Engine
There is also a similar negative story in this thread from 1 year ago, concluding GAE was not ready for commercial production platform: GAE as Production Platform. There are also other threads from 2009 talking about data select limits (1000 rows) that has since been lifted.
My app will essentially perform some mathematical analysis based on data pulled from external data feeds (could be some substantial amount of data), it would be real time only the first time data is downloaded for a specific item at hand and then stored and retrieved locally from the database at that point. There will be some additional external data pulls as scheduled intervals.
Based on this brief description, should I even bother starting on GAE? In general, what are the rules of thumb to try and decide if developing on GAE is suitable for a problem at hand? Also, what are the good examples of Apps in Production that use GAE. It looks like GAE App Gallery is not around anymore, but I would definitely appreciate any Web 2.0 App examples running on the app engine.
In your specific case I would double check these factors:
a. Is the mathematical analysis a long running CPU intensive job?
GAE is not designed for long running CPU intensive computational Jobs; this would lead to have an high billing cost and would force you to design your application to avoid some GAE limitations (10 minutes max per job, limited soft memory, CPU quota, etc. etc.).
b. Are you planning to retrieve external data using a mainstream API (twitter, yahoo, facebook)?
Your application shares the same pool of IPs with other applications; if the API you want to adopt does not allow authenticated request, your application will suffer hiccups caused by throttling/quota limits errors. I faced this problem here.
App Engine should work fine for your application. It's generally designed to serve, and to scale, sites that serve mostly user-facing traffic. Applications that it's not suitable for are things such as video transcoding, which rely heavily on backend processing, or things that have to shell out to native code, such as 3D graphics, etcetera.
Depends on what type of mathematical analysis are you doing. If your application is heavy in I/O, I would give it some pause. On GAE, you're kind of limited in your I/O options. You basically have the following:
RAM: I can't recall exactly, but GAE imposes a hard limit of around 200MB of RAM.
Datastore: You get plenty of space here, but it's slow compared to a cached local file system.
Memcache: Faster than datastore, but not nearly as fast as a cached disk. And worse, it's a cache, so there's no guarantee that it won't get wiped out.
External sources: These include calling out to external web-pages. Lots of flexibility, but very slow.
In sum, I would perhaps look at other options if you're doing heavy I/O on a medium-size dataset (>20MB and ~<2GB). These are probably non-issues for 90% of web-apps, although you should be aware of them.
All the negatives aside, working on GAE is a joyous experience. You spend more time programming and less time configuring. And it's really cheap.
I am learning about the Apache Cassandra database [sic].
Does anyone have any good/bad experiences with deploying Cassandra to less than dedicated hardware like the offerings of Linode or Slicehost?
I think Cassandra would be a great way to scale a web service easily to meet read/write/request load... just add another Linode running a Cassandra node to the existing cluster. Yes, this implies running the public web service and a Cassandra node on the same VPS (which many can take exception with).
Pros of Linode-like deployment for Cassandra:
Private VLAN; the Cassandra nodes could communicate privately
An API to provision a new Linode (and perhaps configure it with a "StackScript" that installs Cassandra and its dependencies, etc.)
The price is right
Cons:
Each host is a VPS and is not dedicated of course
The RAM/cost ratio is not that great once you decide you want 4GB RAM (cf. dedicated at say SoftLayer)
Only 1 disk where one would prefer 2 disks I suppose (1 for the commit log and another disk for the data files themselves). Probably moot since this is shared hardware anyway.
EDIT: found this which helps a bit: http://wiki.apache.org/cassandra/CassandraHardware
I see that 1GB is the minimum but is this a recommendation? Could I deploy with a Linode 720 for instance (say 500 MB usable to Cassandra)? See http://www.linode.com/
How much ram you needs really depends on your workload: if you are write-mostly you can get away with less, otherwise you will want ram for the read cache.
You do get more ram for you money at my employer, rackspace cloud: http://www.rackspacecloud.com/cloud_hosting_products/servers/pricing. (our machines also have raided disks so people typically see better i/o performance vs EC2. Dunno about linode.)
Since with most VPSes you pay roughly 2x for the next-size instance, i.e., about the same as adding a second small instance, I would recommend going with fewer, larger instances than more, smaller ones, since in small numbers network overhead is not negligible.
I do know someone using Cassandra on 256MB VMs but you're definitely in the minority if you go that small.