How does Google App Engine data store scale with Polls - google-app-engine

If I have a poll application on GAE being simultaneously updated across several continents, given the app has been replicated across Google infrastructure, would the data store keep accurate count? Do I need any design consideration for such application?

Applications aren't actually replicated across Google's infrastructure worldwide. If you're using the Master-Slave datastore (the default until very recently), everything you do is strongly consistent, and your reads are all served from a single datacenter (with data replicated to another datacenter as a backup, but not to serve requests ordinarily). With the HR datastore, you do get eventual consistency outside of transactions, but I believe all of the data is in North America and the latency isn't anywhere near what you might expect if the data was being stored on different continents (and, in any case, you can use transactions).

Related

Master/Slave Datastore vs High Replication Datastore

Before starting GAE datastore, I think it will be good to know the difference b/w Master/Slave Datastore and High Replication Datastore?
And what makes GAE team to migrate from Master/Slave to HRD?
The difference between the two (as well as the reason for the switch) is increased fault-tolerance and data consistency.
The Master/Slave Datastore implements a primary-backup protocol. Each app is served by a master (i.e. a single data center) and its data is replicated asynchronously to the slave (i.e. some other data center). The problem with this schema is that it doesn't protect your application from local failures and is more likely to lead to data inconsistencies.
The High Replication Datastore implements the Paxos consensus algorithm to ensure that a majority of data centers maintain a consistent view of your application's data. Because your data is no longer reliant on the health of a single data center, the datastore is able to function properly even in the presence of local/global failures. Google's engineers also benefit from this implementation, as it allows them to perform data center maintenance without having to enforce scheduled read-only periods for AppEngine applications.
The downside of using the High Replication Datastore is slower writes (about 2x as slow, since Paxos is inherently 2-phase). This isn't that big of a deal though, especially when compared to the increased fault tolerance and data consistency that the High Replication Datastore has to offer.
For the first three years of App Engine only with Master/Slave, the health of the datastore was tied to the health of a single data center. Users had low latency and strong consistency, but also transient data unavailability and planned read-only periods.
The High Replication Datastore trades small amounts of latency and consistency for significantly higher availability.
Master/Slave store is deprecated, it's advised that you do not use it, https://developers.google.com/appengine/docs/python/datastore/usingmasterslave

Will using a Cloud PaaS automatically solve scalability issues?

I'm currently looking for a Cloud PaaS that will allow me to scale an application to handle anything between 1 user and 10 Million+ users ... I've never worked on anything this big and the big question that I can't seem to get a clear answer for is that if you develop, let's say a standard application with a relational database and soap-webservices, will this application scale automatically when deployed on a Paas solution or do you still need to build the application with fall-over, redundancy and all those things in mind?
Let's say I deploy a Spring Hibernate application to Amazon EC2 and I create single instance of Ubuntu Server with Tomcat installed, will this application just scale indefinitely or do I need more Ubuntu instances? If more than one Ubuntu instance is needed, does Amazon take care of running the application over both instances or is this the developer's responsibility? What about database storage, can I install a database on EC2 that will scale as the database grow or do I need to use one of their APIs instead if I want it to scale indefinitely?
CloudFoundry allows you to build locally and just deploy straight to their PaaS, but since it's in beta, there's a limit on the amount of resources you can use and databases are limited to 128MB if I remember correctly, so this a no-go for now. Some have suggested installing CloudFoundry on Amazon EC2, how does it scale and how is the database layer handled then?
GAE (Google App Engine), will this allow me to just deploy an app and not have to worry about how it scales and implements redundancy? There appears to be some limitations one what you can and can't run on GAE and their price increase recently upset quite a large number of developers, is it really that expensive compared to other providers?
So basically, will it scale and what needs to be done to make it scale?
That's a lot of questions for one post. Anyway:
Amazon EC2 does not scale automatically with load. EC2 is basically just a virtual machine. You can achieve scaling of EC2 instances with Auto Scaling and Elastic Load Balancing.
SQL databases scale poorly. That's why people started using NoSQL databases in the first place. It's best to see which database your cloud provider offers as a managed service: Datastore on GAE and DynamoDB on Amazon.
Installing your own database on EC2 instances is very impractical as EC2 has ephemeral storage (it looses all data on "disk" when it reboots).
GAE Datastore is actually a one big database for all applications running on it. So it's pretty scalable - your million of users should not be a problem for it.
http://highscalability.com/blog/2011/1/11/google-megastore-3-billion-writes-and-20-billion-read-transa.html
Yes App Engine scales automatically, both frontend instances and database. There is nothing special you need to do to make it scale, just use their API.
There are limitations what you can do with AppEngine:
A. No local storage (filesystem) - you need to use Datastore or Blobstore.
B. Comet is only supported via their proprietary Channels API
C. Datastore is a NoSQL database: no JOINs, limited queries, limited transactions.
Cost of GAE is not bad. We do 1M requests a day for about 5 dollars a day. The biggest saving comes from the fact that you do not need a system admin on GAE ( but you do need one for EC2). Compared to the cost of manpower GAE is incredibly cheap.
Some hints to save money (an speed up) GAE:
A. Use get instead of query in Datastore (requires carefully crafting natiral keys).
B. Use memcache to cache data you got form datastore. This can be done automatically with objectify and it's #Cached annotation.
C. Denormalize data. Meaning you write data redundantly in various places in order to get to it in as few operations as possible.
D. If you have a lot of REST requests from devices, where you do not use cookies, then switch off session support ( or roll your own as we did). Sessions use datastore under the hood and for every request it does get and put.
E. Read about adjusting app settings. Try different settings (depending how tolerant your app is to request delay and your traffic patterns/spikes). We were able to cut down frontend instances by 70%.

Distributed store with transactions

I currently develop an application hosted at google app engine. However, gae has many disadvantages: it's expensive and is very hard to debug since we can't attach to real instances.
I am considering changing the gae to an open source alternative. Unfortunately, none of the existing NOSQL solutions which satisfy me support transactions similar to gae's transactions (gae support transactions inside of entity groups).
What do you think about solving this problem? I am currently considering a store like Apache Cassandra + some locking service (hazelcast) for transactions. Did anyone has any experience in this area? What can you recommend
There are plans to support entity groups in cassandra in the future, see CASSANDRA-1684.
If your data can't be easily modelled without transactions, is it worth using a non transcational database? Do you need the scalability?
The standard way to do transaction like things in cassandra is described in this presentation, starting at slide 24. Basically you write something similar to a WAL log entry to 1 row, then perform the actual writes on multiple rows, then delete the WAL log row. On failure, simply read and perform actions in the WAL log. Since all cassandra writes have a user supplied time stamp, all writes can be made idempotent, just store the time stamp of your write with the WAL log entry.
This strategy gives you the Atomic and Durable in ACID, but you do not get Consistency and Isolation. If you are working at scale that requires something like cassandra, you probably need to give up full ACID transactions anyway.
You may want to try AppScale or TyphoonAE for hosting applications built for App Engine on your own hardware.
If you are developing under Python, you have very interesting debugging options with the Werkzeug debugger.

Caching in Google App Engine/Cloud Based Hosting

I am curious as to how caching works in Google App Engine or any cloud based application. Since there is no guarantee that requests are sent to same sever, does that mean that if data is cached on 1st request on Server A, then on 2nd requests which is processed by Server B, it will not be able to access the cache?
If thats the case (cache only local to server), won't it be unlikely (depending on number of users) that a request uses the cache? eg. Google probably has thousands of servers
With App Engine you cache using memcached. This means that a cache server will hold the data in memory (rather than each application server). The application servers (for a given application) all talk the same cache server (conceptually, there could be sharding or replication going on under the hoods).
In-memory caching on the application server itself will potentially not be very effective, because there is more than one of those (although for your given application there are only a few instances active, it is not spread out over all of Google's servers), and also because Google is free to shut them down all the time (which is a real problem for Java apps that take some time to boot up again, so now you can pay to keep idle instances alive).
In addition to these performance/effectiveness issues, in-memory caching on the application server could lead to consistency problems (every refresh shows different data when the caches are not in sync).
Depends on the type of caching you want to achieve.
Caching on the application server itself can be interesting if you have complex in-memory object structure that takes time to rebuild from data loaded from the database. In that specific case, you may want to cache the result of the computation. It will be faster to use a local cache than a shared memcache to load if the structure is large.
If having consistent value between in-memory and the database is paramount, you can do some checksum/timestamp check with a stored value on the datastore, every time you use the cached value. Storing checksum/timestamp on a small object or in a global cache will fasten the process.
One big issue using global memcache is ensuring proper synchronization on "refilling" it, when a value is not yet present or has been flushed. If you have multiple servers doing the check at the exact same time and refilling value in cache, you may end-up having several distinct servers doing the refill at the same time. If the operation is idem-potent, this is not a problem; if not, a potential and very hard to trace bug.

Does memcached share across servers in google app engine?

On the memcached website it says that memcached is a distributed memory cache. It implies that it can run across multiple servers and maintain some sort of consistency. When I make a request in google app engine, there is a high probability that request in the same entity group will be serviced by the same server.
My question is, say there were two servers servicing my request, is the view of memcached from these two servers the same? That is, do things I put in memcached in one server reflected in the memcached instance for the other server, or are these two completely separate memcached instances (one for each server)?
Specifically, I want each server to actually run its own instance of memcached (no replication in other memcached instances). If it is the case that these two memcached instances update one another concerning changes made to them, is there a way to disable this?
I apologize if these questions are stupid, as I just started reading about it, but these are initial questions I have run into. Thanks.
App Engine does not really use memcached, but rather an API-compatible reimplementation (chiefly by the same guy, I believe -- in his "20% time";-).
Cached values may disappear at any time (via explicit expiration, a crash in one server, or due to memory scarcity in which case they're evicted in least-recently-used order, etc), but if they don't disappear they are consistent when viewed by different servers.
The memcached server chosen doesn't depend on the entity group that you're using (the entity group is a concept from the datastore, a different beast).
Each server runs its own instance of memcached, and each server will store a percentage of the objects that you store in memcache. The way it works is that when you use the memcached API to store something under a given key, a memcached server is chosen (based on the key).
There is no replication between memcached instances, if one of those boxes goes down, you lose 1/N of your memcached' data (N being the number of memcached instances running in AppEngine).
Typically, memcached does not share data between servers. The application server hashes the key to choose a memcached server, and then communicates with that server to get or set the data.
Based in what I know, there is only ONE instance of Memcache of you entire application, there could be many instance of your code running each one with their memory, and many datastore around the world, but there is only one Memcache server at a time, and keep in mind that this susceptible to failure service, even is no SLA for it.

Resources