What are the ways to decrease GAE CPU% usage? - google-app-engine

What did you do to make sure the CPU% is low?
Any sample code to look at?
I ask because every datastore read/query seems to push the CPU% beyond 100% and I get the yellow & red highlight in my dashboard. I read from else where that it's normal but surely there's something can be done about it.

Use appstats to get more detail on any long running tasks. It does a good job breaking down exactly how the CPU time is spent and lets you drill down individual calls and view the stack to narrow down which command is running long.
Urlfetch's and database calls tend to be expensive. As Sam suggests, both can be memcached for very significant savings.

You profile your code and improve its efficiency.

Datastore operations are expensive. Try reducing their usage with the help of memcache

Is your app restarting a lot?
I notice even a very minimal app will take over 1sec to load when it has been inactive for a while -- which brings up a warning marker in the log.

For pages you can cache you can use cache-control if you have a request handler.
self.response.headers["Cache-Control"] = "public,max-age=%s" % 86400
In many cases you also can use a cron job to regularly update your cache.

I've written a simple library to reduce datastore operations by using local instance and memcache as storage layers along with datastore. It also supports cached GQL results. I managed to cut my apps' CPU usage by 50% at least. You can give it a try if you're not using any sensitive data.

Related

Google appengine, least expensive way to run heavy datastore write cron job?

I have a Google appengine application, written in Go, that has a cron process which runs once a day at 3am. This process looks at all of the changes that have happened to my data during the day and stores some meta data about what happened. My users can run reports on this meta data to see trends that have happened over several months. The process does around 10-20 million datastore writes every night. It all works just fine, but since I have started running it I have noticed a significant increase in my monthly bill from Google (from around $50/month to around $400/month).
I have just setup a very basic taskqueue that this runs in, I have not changed the default settings at all. Is there a better way that I could be running this process at night that could save me money? I have never messed around with the backends (which are now depreciated) or the modules api, and I know they've changed a lot of this stuff recently so I'm not sure where to start looking. Any advice would be greatly appreciated.
Look at your instances at 3am. It might be that GAE spins up a lot of them to handle the job. You could configure your job to make it run less paralel so it will take longer but perhaps it will need only 1 instance then.
However, if your database writes are indeed the biggest factor this won't make a big impact.
You can try looking at your data models and indexes. Remember that each indexed field costs 2 writes extra, so see if you can remove indexes from some fields if you don't need them.
One improvement that you can do is to batch your write operations, you can use memcache for this (pay the dedicated one since it's more reliable). Write the updates to memcache, once it's about 900K, flush it to datastore. This will reduces the number of write to datastore A LOT, especially if your metadata's size is small.

How to Gain Visibility and Optimize Quota Usage in Google App Engine?

How do I go about optimizing my Google App Engine app to reduce instance hours I am currently using/paying for?
I have been using app engine for a while and the cost has been creeping upwards. I now spend enough on GAE to invest time into reducing the expense. More than half of my GAE bill is due to frontend instance hours, so it's the obvious place to start. But before I can start optimizing, I have to figure out what's using the instance hours.
However, I am having difficulty trying to determine what is currently using so many of my frontend instance hours. My app serves many ajax requests, dynamic HTML pages, cron jobs, and deferred tasks. For all I know there could be some runaway process that is causing my instance usage to be so high.
What methods or techniques are available to allow me to gain visibility into my app to see where I am using instance hours?
Besides code changes (all suggestions in the other answer are good) you need to look into the instances over time graph.
If you have spikes and constant use, the instances created during the spikes wont go to sleep because appengine will keep using them. In appspot application settings, change the "idle instances" max to a low number like 1 (or your actual daily average).
Also, change min latency to a higher number so less instances will be created on spikes.
All these suggestions can make an immediate effect on lowering your bill, but its just a complement to the code optimizations suggested in the other answer.
This is a very broad question, but I will offer a few pointers.
First, examine App Engine's console Dashboard and logs. See if there are any errors. Errors are expensive both in terms of lost business and in extra instance hours. For example, tasks are retried several times, and these reties may easily prolong the life of an instance beyond what is necessary.
Second, the Dashboard shows you the summary of your requests over 24 hours period. Look for requests with high latency. See if you can improve them. This will both improve the user experience and may reduce the number of instance hours as more requests can be handled by each instance.
Also look for data points that surprise you as a developer of your app. If you see a request that is called many more times that you think is normal, zero in on it and see what it is happening.
Third, look at queues execution rates. When you add multiple tasks to a queue, do you really need all of them to be executed within seconds? If not, reduce the execution rate so that the queue never needs more than one instance.
Fourth, examine your cron jobs. If you can reduce their frequency, you can save a bunch of instance hours. If your cron jobs must run frequently and do a lot of computing, consider moving them to a Compute Engine instance. Compute Engine instances are many times cheaper, so having such an instance run for 24 hours may be a better option than hitting an App Engine instance every 15 minutes (or even every hour).
Fifth, make sure your app is thread-safe, and your App Engine configuration states so.
Finally, do the things that all web developers do (or should do) to improve their apps/websites. Cache what can be cached. Minify what needs to be minified. Put images in sprites. Split you code if it can be split. Use Memcache. Etc. All of these steps reduce latency and/or client-server roundtrips, which helps to reduce the number of instances for the same number of users.
Ok, my other answer was about optimizing at the settings level.
To trace the performance at a granular level use the new cloud trace relased today at google i/o 2014.
http://googledevelopers.blogspot.com/2014/06/cloud-platform-at-google-io-enabling.html

Identify why Google app engine is slow

I developed an application for client that uses Play framework 1.x and runs on GAE. The app works great, but sometimes is crazy slow. It takes around 30 seconds to load simple page but sometimes it runs faster - no code change whatsoever.
Are there any way to identify why it's running slow? I tried to contact support but I couldnt find any telephone number or email. Also there is no response on official google group.
How would you approach this problem? Currently my customer is very angry because of slow loading time, but switching to other provider is last option at the moment.
Use GAE Appstats to profile your remote procedure calls. All of the RPCs are slow (Google Cloud Storage, Google Cloud SQL, ...), so if you can reduce the amount of RPCs or can use some caching datastructures, use them -> your application will be much faster. But you can see with appstats which parts are slow and if they need attention :) .
For example, I've created a Google Cloud Storage cache for my application and decreased execution time from 2 minutes to under 30 seconds. The RPCs are a bottleneck in the GAE.
Google does not usually provide a contact support for a lot of services. The issue described about google app engine slowness is probably caused by a cold start. Google app engine front-end instances sleep after about 15 minutes. You could write a cron job to ping instances every 14 minutes to keep the nodes up.
Combining some answers and adding a few things to check:
Debug using app stats. Look for "staircase" situations and RPC calls. Maybe something in your app is triggering RPC calls at certain points that don't happen in your logic all the time.
Tweak your instance settings. Add some permanent/resident instances and see if that makes a difference. If you are spinning up new instances, things will be slow, for probably around the time frame (30 seconds or more) you describe. It will seem random. It's not just how many instances, but what combinations of the sliders you are using (you can actually hurt yourself with too little/many).
Look at your app itself. Are you doing lots of memory allocations in the JVM? Allocating/freeing memory is inherently a slow operation and can cause freezes. Are you sure your freezing is not a JVM issue? Try replicating the problem locally and tweak the JVM xmx and xms settings and see if you find similar behavior. Also profile your application locally for memory/performance issues. You can cut down on allocations using pooling, DI containers, etc.
Are you running any sort of cron jobs/processing on your front-end servers? Try to move as much as you can to background tasks such as sending emails. The intervals may seem random, but it can be a result of things happening depending on your job settings. 9 am every day may not mean what you think depending on the cron/task options. A corollary - move things to back-end servers and pull queues.
It's tough to give you a good answer without more information. The best someone here can do is give you a starting point, which pretty much every answer here already has.
By making at least one instance permanent, you get a great improvement in the first use. It takes about 15 sec. to load the application in the instance, which is why you experience long request times, when nobody has been using the application for a while

Is app engine more expensive when it's slower?

There have been quite a few occasions recently when app engine appears to run slower. To some degree that's understandable with the architecture of their cloud platform. I'm not talking about new server instances - just requests to warm servers. I'm also just referring to CPU, not datastore API, but I do wonder about that as well.
It seems that during these slow periods I get a lot more yellow warnings on my requests - saying I am using a lot of CPU. Certainly they take longer to complete during this period. What concerns me is that during these slow periods, my billable CPU seems to go up.
So to be clear - when app engine is fast, a request might complete in 100ms. In a slow period, it might take more than 1s for the same request. Same URI, same caching, same processing path, same datastore, same indexes - much more CPU. The yellow warnings, as I understand it, are referring to billable CPU usage, and there's many more of them when app engine is slower.
This seems to set up a bizarre situation where my app costs more to run when app engine performance is worse. This means google makes more money the more poorly the platform performs (up to the point where it fails or customers leave). Maybe I've got the situation all wrong, and it doesn't work like that - but if it does work like that, then as a customer the pressures and balances there are all wrong. That's not intimating any wrong-doing on google's part - just that the relationships between those two things don't seem right.
It almost seems like google's algorithm goes something like - 'If I give a processing job to a CPU and start my watch, then stop it when the job returns I get the billable CPU figure.' i.e. it doesn't measure CPU work at all. Surely that time should be divided by the number of processing jobs being concurrently executed plus some extra to cover the additional context switching. I'm sure that stuff is hard to measure - perhaps that's the reason.
I guess you could argue it is fair that you pay more when app engine is in high demand, but that makes budgeting close to impossible - you can't generate stats like '100 users costs me $1 a day', because that could change for a whole host of reasons - including app engine onboarding more customers than the infrastructure can realistically handle. If google over-subscribes app engine then all customers pay more - it's another relationship that doesn't sound right. Surely google's costs should go down as they onboard more customers, and those customers use more resources - based on economies of scale.
Should I expect two identical requests in my app to cost me roughly the same amount each time they run - regardless of how much wall-time app engine takes to actually complete them? Have I misunderstood how this works? If I haven't, is there a reason why I shouldn't be worried about it in the long term? Is there some documentation which makes this situation clearer? Cheers,
Colin
It would be more complicated, but they could change the billing algorithm to be a function of load. Or perhaps they could normalize the CPU measurements based on the performance of similar calls in the past.
I agree that this presents problems for the developers.
Yes this is true. It is a bummer. It also takes them over a second to start up my Java application (which I was billed for) every time they decided my site was in low demand, and didn't need the resources.
I ended up using a cron to auto ping my site every minute to keep it warm.. doing all the wasted work made my bill cheaper, as it didn't have the startup time, instead it just had lots of 2ms pings...
This question appears old and I think the pricing scheme must have changed...
The Google App Engine charges for "instance hours" and the instances currently spawned are viewable in the GAE console. And Google provides adjustments so you can decide cost vs latency for your app.
https://developers.google.com/appengine/docs/adminconsole/performancesettings
I did noticed that if the front-end is bogged down hitting a common backend resource that GAE will spawn a bunch of instances to get latency down. And you will pay for those instance hours even though latency/throughput doesn't improve. The adjustments I mentioned seem to help with that.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

Resources