I have an app that spikes postgresql queries at evening (over 80% CPU usage), and has very few the rest of the day (< 20% usage). I want this machine to autoscale when it queries PostgreSQL, instead of getting scared we will hit CPU limits. I have two questions.
When we hit 100%, do requests get delayed?
Is there an autoscaling PostgreSQL? It seems there's one on AWS but I don't know how to move from GCP to AWS easily without downtime...
1- Yes, When CPU usage hits 100%, The requests will be delayed.
2- Autoscaling for storage is set up by default but there is no autoscaling feature for cores and memory.
Autoscaling works by adding more VMs to your Managed instance groups when there is more load, and deleting VMs when the need for VMs is lowered.
You can set up the alert policy in Cloud Monitoring to get notified when an instance is reaching peak usage and then make changes manually.
You can also create a feature request for autoscaling and present your use case here to have Engineering potentially look into implementing it.
Related
I was searching for a solution which would allow me to run tasks up to 24h long. The combination of Cloud Tasks and multiple AppEngine Backend instances seemed like a perfect way to go.
As the tasks are long running, I would like to scale to max_instances as fast as possible. But I am having trouble to do so.
Here is my app.yaml
service: slow
runtime: python37
# --timeout=90000 (25h) -> AppEngine Backend Instance should raise TimeoutExceededError after 24h
entrypoint: gunicorn main:app --workers 1 --timeout=90000
instance_class: B2
basic_scaling:
max_instances: 15
Here is a printscreen of my Cloud Task Queue configuration.
My issue is that the tasks from Cloud Task Queue are not spawning new instances as I would expect (eg. 15 max_concurrent_tasks in queue settings should spawn 15 backend instances).
I somehow managed to overcome this issue by aggressively increasing max_concurrent_tasks in the queue configuration (200 max_concurrent_tasks will spawn 15 backend instances).
Unfortunately, as the number of tasks in queue decreases, the backend instances will start terminating.
Now, there is 8 tasks left in queue (out of several hundreds) and only 1 backend instance , which is running 1 task only. I cannot trigger starting additional instances even by clicking on the "RUN TASK" button in CloudTask web UI.
Has anyone of you came across similar issue?
Do you have any hint why this might be happening?
Why doesn't cloud task hit /_ah/start endpoint to spin up a new instance to run on?
I am not sure if basic scaling is good idea. According to the documentation:
If you use basic scaling, App Engine attempts to keep your cost low,
even though that may result in higher latency as the volume of
incoming requests increases
It seems that automatic scaling will be better idea. If you take a look at the same document you can find:
If you use automatic scaling, each instance in your app has its own
queue for incoming requests. Before the queues become long enough to
have a noticeable effect on your app's latency, App Engine
automatically creates one or more new instances to handle the
increasing load.
You can configure the settings for automatic scaling to achieve a
trade-off between the performance you want and the cost you can incur.
The documentation mentions 3 settings that can be used:
Target CPU Utilization
Target Throughput Utilization
Max Concurrent Requests
I think you should be able find the configuration that will serve you the best with those 3.
Have a production web-application deployment on GAE (LAMP Stack) with autoscaled setting, and according to the documentation, Google will automatically spin-up additional instances to meet demand; this seemed to have been proven when we went live, hours before a season finale aired which would guarantee traffic hit our site, and our site did NOT fall-over even with the expected sizable influx - so kudos to Google! However, I'd be naive to think that this server architecture is done, knowing that we're still in our infancy, and we could potentially get 10 - 100x more traffic in the near future on a consistent basis when we gain popularity and move into the global market. So my question is:
Should I be implementing a Load Balancer in GCP or will GAE be able to scale "indefinitely" to accommodate?
Based on this answer: AppEngine load balancing across multiple regions you'll need to implement the load balancer if you're targeting multiple regions.
Otherwise it will be dependent on your configuration and the thresholds you've set on your GAE config.
According to https://cloud.google.com/appengine/docs/standard/go/how-instances-are-managed, there are three ways you can define scaling on your AppEngine instance:
Automatic scaling
Automatic scaling creates dynamic instances based on request rate, response latencies, and other application metrics. However, if you
specify a number of minimum idle instances, that specified number of
instances run as resident instances while any additional instances
are dynamic.
Basic Scaling
Basic scaling creates dynamic instances when your application receives requests. Each instance will be shut down when the app
becomes idle. Basic scaling is ideal for work that is intermittent or
driven by user activity.
Manual scaling
Manual scaling uses resident instances that continuously run the specified number of instances regardless of the load level. This
allows tasks such as complex initializations and applications that
rely on the state of the memory over time.
So the answer is it depends. You'll just need to base your scaling strategy on how your load distribution looks. I would expect that the automatic scaling is fine for 90% of early-stage websites, though that's just my impression.
Is there a limit on the rate of inferences one can make for a SageMaker endpoint?
Is it determined somehow by the instance type behind the endpoint or the number of instances?
I tried looking for this info as AWS Service Quotas for SageMaker but couldn't find it.
I am invoking the endpoint from a Spark job abd wondered if the number of concurrent tasks is a factor I should be taking care of when running inference (assuming each task runs one inference at a time)
Here's the throttling error I got:
com.amazonaws.services.sagemakerruntime.model.AmazonSageMakerRuntimeException: null (Service: AmazonSageMakerRuntime; Status Code: 400; Error Code: ThrottlingException; Request ID: b515121b-f3d5-4057-a8a4-6716f0708980)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.doInvoke(AmazonSageMakerRuntimeClient.java:236)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invoke(AmazonSageMakerRuntimeClient.java:212)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.executeInvokeEndpoint(AmazonSageMakerRuntimeClient.java:176)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invokeEndpoint(AmazonSageMakerRuntimeClient.java:151)
at lineefd06a2d143b4016906a6138a6ffec15194.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$a5cddfc4633c5dd8aa603ddc4f9aad5$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Predictor.predict(command-2334973:41)
at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:2000)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:533)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:539)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Amazon SageMaker is offering model hosting service (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html), which gives you a lot of flexibility based on your inference requirements.
As you noted, first you can choose the instance type to use for your model hosting. The large set of options is important to tune to your models. You can host the model on a GPU based machines (P2/P3/P4) or CPU ones. You can have instances with faster CPU (C4, for example), or more RAM (R4, for example). You can also choose instances with more cores (16xl, for example) or less (medium, for example). Here is a list of the full range of instances that you can choose: https://aws.amazon.com/sagemaker/pricing/instance-types/ . It is important to balance your performance and costs. The selection of the instance type and the type and size of your model will determine the invocations-per-second that you can expect from your model in this single-node configuration. It is important to measure this number to avoid hitting the throttle errors that you saw.
The second important feature of the SageMaker hosting that you use is the ability to auto-scale your model to multiple instances. You can configure the endpoint of your model hosting to automatically add and remove instances based on the load on the endpoint. AWS is adding a load balancer in front of the multiple instances that are hosting your models and distributing the requests among them. Using the autoscaling functionality allows you to keep a smaller instance for low traffic hours, and to be able to scale up during peak traffic hours, and still keep your costs low and your throttle errors to the minimum. See here for documentation on the SageMaker autoscaling options: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html
App Engine has been great for requests that process quickly with no external API calls to databases or caches or third-party resources, but we've found that introducing any sort of "longer running" component or external latency (for example in a HTTP POST operation that runs asynchronously in the background and might take a second or two to process a few more intense database queries... totally invisible and OK from a UX perspective on the client-side because it's asynchronous but expensive to App Engine billing since it's long running) ... the "instance hours" compound and drive costs up considerably.
These sorts of expense inducing situations where a request is literally just waiting for a response from an external resource and requiring almost zero CPU during their idling seem avoidable, but I'm not sure if it's avoidable with App Engine.
It's almost like a "long poll" where the response might be left open but doing nothing.
Is there a way to do this on App Engine without just paying an insane amount for instance hours, or would we be better off moving to Compute Engine or EC2? Does it scale automatically based on CPU load, or is it based solely on open and perhaps inactive requests in total count? — threadsafe is indeed enabled.
There are really two ways to go about this one (top of mind).
Use Task Queues!
If the work doesn't need to be exactly at the same time of the request, this is exactly what [task queues] in App Engine are for. They allow you to put a job on a queue, and have another module pick up the work. They're kind of great because you can separately scale your front end and back end processes.
If that doesn't work....
Use App Engine Flexible
Under the hood App Engine Flexible is just running GCE instances. The cost structure is entirely different, since you persistently have a VM running in the background serving your requests.
Hope this helps!
What you're really worried about here is how App Engine scales your instances. Because many of your requests require few resources, your app might be able to handle many more concurrent requests on a single instance than normal. You can look into parameters that shape scaling here. Of particular interest:
max_concurrent_requests The number of concurrent requests an automatic scaling instance can accept before the scheduler spawns a new instance (Default: 8, Maximum: 80).
There is a danger here, where an instance may fill up with non-long-polling requests and become overburdened. To prevent that, you could isolate your long-polling requests into their own service and set its scaling parameters separately from the rest of your app.
I'm currently looking for a Cloud PaaS that will allow me to scale an application to handle anything between 1 user and 10 Million+ users ... I've never worked on anything this big and the big question that I can't seem to get a clear answer for is that if you develop, let's say a standard application with a relational database and soap-webservices, will this application scale automatically when deployed on a Paas solution or do you still need to build the application with fall-over, redundancy and all those things in mind?
Let's say I deploy a Spring Hibernate application to Amazon EC2 and I create single instance of Ubuntu Server with Tomcat installed, will this application just scale indefinitely or do I need more Ubuntu instances? If more than one Ubuntu instance is needed, does Amazon take care of running the application over both instances or is this the developer's responsibility? What about database storage, can I install a database on EC2 that will scale as the database grow or do I need to use one of their APIs instead if I want it to scale indefinitely?
CloudFoundry allows you to build locally and just deploy straight to their PaaS, but since it's in beta, there's a limit on the amount of resources you can use and databases are limited to 128MB if I remember correctly, so this a no-go for now. Some have suggested installing CloudFoundry on Amazon EC2, how does it scale and how is the database layer handled then?
GAE (Google App Engine), will this allow me to just deploy an app and not have to worry about how it scales and implements redundancy? There appears to be some limitations one what you can and can't run on GAE and their price increase recently upset quite a large number of developers, is it really that expensive compared to other providers?
So basically, will it scale and what needs to be done to make it scale?
That's a lot of questions for one post. Anyway:
Amazon EC2 does not scale automatically with load. EC2 is basically just a virtual machine. You can achieve scaling of EC2 instances with Auto Scaling and Elastic Load Balancing.
SQL databases scale poorly. That's why people started using NoSQL databases in the first place. It's best to see which database your cloud provider offers as a managed service: Datastore on GAE and DynamoDB on Amazon.
Installing your own database on EC2 instances is very impractical as EC2 has ephemeral storage (it looses all data on "disk" when it reboots).
GAE Datastore is actually a one big database for all applications running on it. So it's pretty scalable - your million of users should not be a problem for it.
http://highscalability.com/blog/2011/1/11/google-megastore-3-billion-writes-and-20-billion-read-transa.html
Yes App Engine scales automatically, both frontend instances and database. There is nothing special you need to do to make it scale, just use their API.
There are limitations what you can do with AppEngine:
A. No local storage (filesystem) - you need to use Datastore or Blobstore.
B. Comet is only supported via their proprietary Channels API
C. Datastore is a NoSQL database: no JOINs, limited queries, limited transactions.
Cost of GAE is not bad. We do 1M requests a day for about 5 dollars a day. The biggest saving comes from the fact that you do not need a system admin on GAE ( but you do need one for EC2). Compared to the cost of manpower GAE is incredibly cheap.
Some hints to save money (an speed up) GAE:
A. Use get instead of query in Datastore (requires carefully crafting natiral keys).
B. Use memcache to cache data you got form datastore. This can be done automatically with objectify and it's #Cached annotation.
C. Denormalize data. Meaning you write data redundantly in various places in order to get to it in as few operations as possible.
D. If you have a lot of REST requests from devices, where you do not use cookies, then switch off session support ( or roll your own as we did). Sessions use datastore under the hood and for every request it does get and put.
E. Read about adjusting app settings. Try different settings (depending how tolerant your app is to request delay and your traffic patterns/spikes). We were able to cut down frontend instances by 70%.