Before starting GAE datastore, I think it will be good to know the difference b/w Master/Slave Datastore and High Replication Datastore?
And what makes GAE team to migrate from Master/Slave to HRD?
The difference between the two (as well as the reason for the switch) is increased fault-tolerance and data consistency.
The Master/Slave Datastore implements a primary-backup protocol. Each app is served by a master (i.e. a single data center) and its data is replicated asynchronously to the slave (i.e. some other data center). The problem with this schema is that it doesn't protect your application from local failures and is more likely to lead to data inconsistencies.
The High Replication Datastore implements the Paxos consensus algorithm to ensure that a majority of data centers maintain a consistent view of your application's data. Because your data is no longer reliant on the health of a single data center, the datastore is able to function properly even in the presence of local/global failures. Google's engineers also benefit from this implementation, as it allows them to perform data center maintenance without having to enforce scheduled read-only periods for AppEngine applications.
The downside of using the High Replication Datastore is slower writes (about 2x as slow, since Paxos is inherently 2-phase). This isn't that big of a deal though, especially when compared to the increased fault tolerance and data consistency that the High Replication Datastore has to offer.
For the first three years of App Engine only with Master/Slave, the health of the datastore was tied to the health of a single data center. Users had low latency and strong consistency, but also transient data unavailability and planned read-only periods.
The High Replication Datastore trades small amounts of latency and consistency for significantly higher availability.
Master/Slave store is deprecated, it's advised that you do not use it, https://developers.google.com/appengine/docs/python/datastore/usingmasterslave
Related
Using AppEngine with Python and the HRD retrieving records sequentially (via an indexed field which is an incrementing integer timestamp) we get 15,000 records returned in 30-45 seconds. (Batching and limiting is used.) I did experiment with doing queries on two instances in parallel but still achieved the same overall throughput.
Is there a way to improve this overall number without changing any code? I'm hoping we can just pay some more and get better database throughput. (You can pay more for bigger frontends but that didn't affect database throughput.)
We will be changing our code to store multiple underlying data items in one database record, but hopefully there is a short term workaround.
Edit: These are log records being downloaded to another system. We will fix it in the future and know how to do so, but I'd rather work on more important things first.
Try splitting the records on different entity groups. That might force them to go to different physical servers. Read entity groups in parallel from multiple threads or instances.
Using cache mght not work well for large tables.
Maybe you can cache your records, like use Memcache:
https://developers.google.com/appengine/docs/python/memcache/
This could definitely speed up your application access. I don't think that App Engine Datastore is designed for speed but for scalability. Memcache however is.
BTW, if you are conscious about the performance that GAE gives as per what you pay, then maybe you can try setting up your own App Engine cloud with:
AppScale
JBoss CapeDwarf
Both have an active community support. I'm using CapeDwarf in my local environment it is still in BETA but it works.
Move to any of the in-memory databases. If you have Oracle Database, using TimesTen will improve the throughput multifold.
I am using Google App Engine for an app, and the app is currently hitting the datastore at a rate of around 2.5 million row writes, and 4.5 million row reads per day.
I am currently porting the app to Amazon Elastic Beanstalk and Amazon RDS due to the very high costs of running an application on GAE.
Based on the values above, how can I find out / estimate what type of RDS instance I will need for my requirements? Is the above a considerable amount of processing for, lets say a Small or Micro MySQL RDS instance to process in a day?
Totally depends on a number of factors:
Row size.
Field types and sizes.
Complexity of your queries (joins, etc).
Proper use of indexes.
Row contention and other possible bottlenecks.
Really hard to tell. But from experience, if you don't need fancy replication or sharding, the costs of the GAE datastore are usually higher as it offers total redundancy, distribution, scalability, etc.
My suggestion would be to write a quick program to benchmark a load on RDS that replicates what you are expecting. Should be easy to write if you forgo all the business rules and such and just do fake but randomized reads and writes.
I currently develop an application hosted at google app engine. However, gae has many disadvantages: it's expensive and is very hard to debug since we can't attach to real instances.
I am considering changing the gae to an open source alternative. Unfortunately, none of the existing NOSQL solutions which satisfy me support transactions similar to gae's transactions (gae support transactions inside of entity groups).
What do you think about solving this problem? I am currently considering a store like Apache Cassandra + some locking service (hazelcast) for transactions. Did anyone has any experience in this area? What can you recommend
There are plans to support entity groups in cassandra in the future, see CASSANDRA-1684.
If your data can't be easily modelled without transactions, is it worth using a non transcational database? Do you need the scalability?
The standard way to do transaction like things in cassandra is described in this presentation, starting at slide 24. Basically you write something similar to a WAL log entry to 1 row, then perform the actual writes on multiple rows, then delete the WAL log row. On failure, simply read and perform actions in the WAL log. Since all cassandra writes have a user supplied time stamp, all writes can be made idempotent, just store the time stamp of your write with the WAL log entry.
This strategy gives you the Atomic and Durable in ACID, but you do not get Consistency and Isolation. If you are working at scale that requires something like cassandra, you probably need to give up full ACID transactions anyway.
You may want to try AppScale or TyphoonAE for hosting applications built for App Engine on your own hardware.
If you are developing under Python, you have very interesting debugging options with the Werkzeug debugger.
Google start to use The High Replication datastore (HRD) as the default for new applications.
HR from the docs:
The HRD is a highly available, highly
reliable storage solution. It remains
available for reads and writes during
planned downtime and is extremely
resilient in the face of catastrophic
failure—but it costs more than the
master/slave option.
M/S from the docs:
your data may be temporarily
unavailable during data center issues
or planned downtime
Now, have you ever expirienced downtime? If this "downtime disclaimer" is just something theorical and doesn't happen frecuently I would use the M/S becouse it's cheaper.
What are the numbers that Google handle to say "downtime"? maybe their downtime is just a few seconds in a year, something totaly acceptable for some kind of apps.
Would love answers from experienced AppEngine developers.
I would recommend you to use HRD as Google said they will make M/S more expensive than HRD until the end of the year and even remove the M/S option as they are looking to "force" the businesses & developers to take advantage of all HRD goodies. The real reason is that maintaining a single type of infrastructure is cheaper than to maintain the both HRD and M/S so Google picks HRD.
Source : Google I/O 2011
Downtime isn't theoretical - it happens in any distributed system. There are two types, roughly speaking: localized and global. Localized issues occur when a particular machine has trouble and can't serve requests; global downtime happens when something happens to the service as a whole.
Both can occur on App Engine: the former due to localized hardware failure, and the latter generally only due to planned maintenance that requires setting the master-slave datastore read-only for a brief period. The HR datastore handles both more robustly than the MS datastore, and doesn't require a read-only period during maintenance windows.
Once the new pricing scheme comes into effect, both datastores will be charged at the same rate.
For these and many other reasons, you should always use the HR datastore in new apps.
If I have a poll application on GAE being simultaneously updated across several continents, given the app has been replicated across Google infrastructure, would the data store keep accurate count? Do I need any design consideration for such application?
Applications aren't actually replicated across Google's infrastructure worldwide. If you're using the Master-Slave datastore (the default until very recently), everything you do is strongly consistent, and your reads are all served from a single datacenter (with data replicated to another datacenter as a backup, but not to serve requests ordinarily). With the HR datastore, you do get eventual consistency outside of transactions, but I believe all of the data is in North America and the latency isn't anywhere near what you might expect if the data was being stored on different continents (and, in any case, you can use transactions).