Google App Engine and data replication with multiple instances

Google App Engine and data replication with multiple instances - google-app-engine

I'm using the Google App Engine as the backend for an iOS game that was just released.
Through that act of playing the game, players create levels and then those are shared with their friends and the world at large. GAE is used to store and retrieve those levels. GAE also Manages player's high scores since they are more complex than Game Center can handle.
As a whole, GAE works great. I like how GAE spins up new instances as they are needed without me having to constantly monitor load. For this game, GAE is running around 10 instance and serving around 8 queries a second.
But there is a small problem.
I've noticed that sometimes players will get on the high score table twice. This should not be possible since I remove any old scores before putting up the new scores (this is done in one query to GAE).
After some testing and poking around, it seems that what is happening is that a player will get a high score and instance 1 handles the removing of the old score and the adding of the new one. The player then gets a new high score, but this time instance 4 is the one that handles the request and it doesn't know about the other score yet.
At their fastest, it might take a player 10 seconds to get a new high score. It was my understanding that the replication of data only took 2 or 3 seconds.
I never saw this problem during testing because load rarely caused 2 instances to be started.
Does this seem like a plausible explanation for what is happening and how data is stored for each instance?
Is there a way to guarantee that data added, deleted or altered in one instance will be available in another? High scores are not "mission critical", so I'm not too worried about it, but I would like to use GAE for some more complex situations where it is very important that data is consistent.
Is that possible with GAE, or should I be looking at other solutions?

It is possible to guarantee that data will be consistent across all data centers (strong consistency). You need to use ancestor queries to achieve it. However, doing so poses a restriction on how many write per seconds you can achieve. Currently the limit is 1 write per second.
If the write limit is too slow for you, one alternative is to add a cache layer. So you will still be using the eventual consistency model, but you will mix those results with the ones in memcache.
See the doc Structuring for Strong Consistency for further details.

Related

Should I set max limits on as many entities as possible in my webapp?

I am working on a CRUD webapp where users create and manager their data assets.
Having no desire to be bombarded with tons of data for the first time, I think that it would be reasonable to set limits where possible in the database.
For example, limit number of created items A = 10, B = 20, C = 50 then, if user reaches the limit, have a look at his account and figure out if I should update the rules if it doesn't break the code and performance.
Is it a good practice at all to set such limits from the performance/maintenance side, not from the business side or should I think like data entities are unlimited and try to make it well-performing with lots of data from the start?

You suggest to test your application's performance on real users, which is bad. In addition, your solution will create inconvenience for users by limiting them, when there is no reason for that (at least from user's point of view), which decreases user's satisfaction.
Instead, you should test performance before you release. It will give you understanding of your application's and infrastructure's limits of running under high load. Also, it will help you to find and eliminate bottle necks in your code. You can perform such testing with tools like JMeter and many others.
Also, if you afraid of tons of data at start moment, you can release your application as private beta: just make a simple form where users can ask for early access and receive invite. By sending invites you can easily control growth of user base and therefore loading on you app.
But you should, of course, create limitations where it is necessary, for example, limit items per page, etc.

How to Gain Visibility and Optimize Quota Usage in Google App Engine?

How do I go about optimizing my Google App Engine app to reduce instance hours I am currently using/paying for?
I have been using app engine for a while and the cost has been creeping upwards. I now spend enough on GAE to invest time into reducing the expense. More than half of my GAE bill is due to frontend instance hours, so it's the obvious place to start. But before I can start optimizing, I have to figure out what's using the instance hours.
However, I am having difficulty trying to determine what is currently using so many of my frontend instance hours. My app serves many ajax requests, dynamic HTML pages, cron jobs, and deferred tasks. For all I know there could be some runaway process that is causing my instance usage to be so high.
What methods or techniques are available to allow me to gain visibility into my app to see where I am using instance hours?

Besides code changes (all suggestions in the other answer are good) you need to look into the instances over time graph.
If you have spikes and constant use, the instances created during the spikes wont go to sleep because appengine will keep using them. In appspot application settings, change the "idle instances" max to a low number like 1 (or your actual daily average).
Also, change min latency to a higher number so less instances will be created on spikes.
All these suggestions can make an immediate effect on lowering your bill, but its just a complement to the code optimizations suggested in the other answer.

This is a very broad question, but I will offer a few pointers.
First, examine App Engine's console Dashboard and logs. See if there are any errors. Errors are expensive both in terms of lost business and in extra instance hours. For example, tasks are retried several times, and these reties may easily prolong the life of an instance beyond what is necessary.
Second, the Dashboard shows you the summary of your requests over 24 hours period. Look for requests with high latency. See if you can improve them. This will both improve the user experience and may reduce the number of instance hours as more requests can be handled by each instance.
Also look for data points that surprise you as a developer of your app. If you see a request that is called many more times that you think is normal, zero in on it and see what it is happening.
Third, look at queues execution rates. When you add multiple tasks to a queue, do you really need all of them to be executed within seconds? If not, reduce the execution rate so that the queue never needs more than one instance.
Fourth, examine your cron jobs. If you can reduce their frequency, you can save a bunch of instance hours. If your cron jobs must run frequently and do a lot of computing, consider moving them to a Compute Engine instance. Compute Engine instances are many times cheaper, so having such an instance run for 24 hours may be a better option than hitting an App Engine instance every 15 minutes (or even every hour).
Fifth, make sure your app is thread-safe, and your App Engine configuration states so.
Finally, do the things that all web developers do (or should do) to improve their apps/websites. Cache what can be cached. Minify what needs to be minified. Put images in sprites. Split you code if it can be split. Use Memcache. Etc. All of these steps reduce latency and/or client-server roundtrips, which helps to reduce the number of instances for the same number of users.

Ok, my other answer was about optimizing at the settings level.
To trace the performance at a granular level use the new cloud trace relased today at google i/o 2014.
http://googledevelopers.blogspot.com/2014/06/cloud-platform-at-google-io-enabling.html

Strategy for caching of remote service; what should I be considering?

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.
time since item was created (less time means more important)
number of 'likes' a particular item has (could mean higher probability of being viewed)
time since last updated
A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).

Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?
Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.
Anyway, to throw out some thoughts on standards for crawling:
Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
A statistical model of the chance of an image being updated and needing to be recached.
An importance score for each image, using things like the recency of the image, or the currency of its event.
Update things that are being viewed frequently.
Update things that have many views.
Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
2,500 requests processing new images (which you mentioned are more important)
2,500 requests processing images of current events
2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
2,500 requests processing images that have been viewed at least
Total: 15,000 requests per hour.

How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.
andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.
At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

Is a real-time multiplayer game using Google App Engine feasible?

I am currently developing a real-time multiplayer game, and have been evaluating various cloud-based hosting solutions. I am unsure whether App Engine fits my needs, and would be grateful for any feedback.
In essence, I want the system to work like this: Player A calculates round n, and generates a hash out of the game state at the end of that round. He then sends his commands for that round, and the hash, as a http POST to the server. Player B does the same thing, in parallel.
The server, while handling the POST from a player, first writes the received hash code to the memcache. If the hash from the other player is not yet in the memcache, it waits and periodically checks the memcache for the other players hash. As soon as both hashes are in the memcache, it compares them for equality. If they are equal, the server sends the commands of each player to the respectively other one as the http response.
A round like that should last around half a second, meaning two requests per player per second.
Of course, this way of doing it will only work if there are at least two instances of the application running, as two requests must be dealt with in parallel. Also, the memory cache must be consistent over all instances, be fairly reliable, and update immediately.
I cannot use XMPP because I want my game to be able to run within restricted networks, so it has to be limited to http on port 80.
Is there a way to enforce that two instances of the app are always running? Are there glaringly obvious flaws in my design? Do you think an architecture like this might work on App Engine? If not, what cloud based solution would you suggest?

I believe this could work. The key API for you to learn about / test would probably be the Channel API. That is what would allow back and forth communication between the client and server.
The next issue to worry about would be memcache. In general, it is reliable, but in the strictest sense we are supposed to assume that memcached data could disappear at any time.
If you decide that you can't risk losing the data like that, then you need to persist it in the datastore, which means you will have to experiment to make sure you can sustain 2 moves per turn. I think this is possible, but not trivially so. If you had said 1 move every 3 seconds I would say "no problem." But multiple updates to one entity per second start to bump up against the practical limit on writes per second, especially if they are transactional.
Having multiple instances running will not be a problem - you can pay to keep instances warm if necessary.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.

First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.

1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).

Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.

You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight