GAE, am I still required to implement a load balancer? - google-app-engine

Have a production web-application deployment on GAE (LAMP Stack) with autoscaled setting, and according to the documentation, Google will automatically spin-up additional instances to meet demand; this seemed to have been proven when we went live, hours before a season finale aired which would guarantee traffic hit our site, and our site did NOT fall-over even with the expected sizable influx - so kudos to Google! However, I'd be naive to think that this server architecture is done, knowing that we're still in our infancy, and we could potentially get 10 - 100x more traffic in the near future on a consistent basis when we gain popularity and move into the global market. So my question is:
Should I be implementing a Load Balancer in GCP or will GAE be able to scale "indefinitely" to accommodate?

Based on this answer: AppEngine load balancing across multiple regions you'll need to implement the load balancer if you're targeting multiple regions.
Otherwise it will be dependent on your configuration and the thresholds you've set on your GAE config.
According to https://cloud.google.com/appengine/docs/standard/go/how-instances-are-managed, there are three ways you can define scaling on your AppEngine instance:
Automatic scaling
Automatic scaling creates dynamic instances based on request rate, response latencies, and other application metrics. However, if you
specify a number of minimum idle instances, that specified number of
instances run as resident instances while any additional instances
are dynamic.
Basic Scaling
Basic scaling creates dynamic instances when your application receives requests. Each instance will be shut down when the app
becomes idle. Basic scaling is ideal for work that is intermittent or
driven by user activity.
Manual scaling
Manual scaling uses resident instances that continuously run the specified number of instances regardless of the load level. This
allows tasks such as complex initializations and applications that
rely on the state of the memory over time.
So the answer is it depends. You'll just need to base your scaling strategy on how your load distribution looks. I would expect that the automatic scaling is fine for 90% of early-stage websites, though that's just my impression.

Related

Limit on the rate of inferences one can make for a SageMaker endpoint

Is there a limit on the rate of inferences one can make for a SageMaker endpoint?
Is it determined somehow by the instance type behind the endpoint or the number of instances?
I tried looking for this info as AWS Service Quotas for SageMaker but couldn't find it.
I am invoking the endpoint from a Spark job abd wondered if the number of concurrent tasks is a factor I should be taking care of when running inference (assuming each task runs one inference at a time)
Here's the throttling error I got:
com.amazonaws.services.sagemakerruntime.model.AmazonSageMakerRuntimeException: null (Service: AmazonSageMakerRuntime; Status Code: 400; Error Code: ThrottlingException; Request ID: b515121b-f3d5-4057-a8a4-6716f0708980)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.doInvoke(AmazonSageMakerRuntimeClient.java:236)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invoke(AmazonSageMakerRuntimeClient.java:212)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.executeInvokeEndpoint(AmazonSageMakerRuntimeClient.java:176)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invokeEndpoint(AmazonSageMakerRuntimeClient.java:151)
at lineefd06a2d143b4016906a6138a6ffec15194.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$a5cddfc4633c5dd8aa603ddc4f9aad5$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Predictor.predict(command-2334973:41)
at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:2000)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:533)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:539)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Amazon SageMaker is offering model hosting service (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html), which gives you a lot of flexibility based on your inference requirements.
As you noted, first you can choose the instance type to use for your model hosting. The large set of options is important to tune to your models. You can host the model on a GPU based machines (P2/P3/P4) or CPU ones. You can have instances with faster CPU (C4, for example), or more RAM (R4, for example). You can also choose instances with more cores (16xl, for example) or less (medium, for example). Here is a list of the full range of instances that you can choose: https://aws.amazon.com/sagemaker/pricing/instance-types/ . It is important to balance your performance and costs. The selection of the instance type and the type and size of your model will determine the invocations-per-second that you can expect from your model in this single-node configuration. It is important to measure this number to avoid hitting the throttle errors that you saw.
The second important feature of the SageMaker hosting that you use is the ability to auto-scale your model to multiple instances. You can configure the endpoint of your model hosting to automatically add and remove instances based on the load on the endpoint. AWS is adding a load balancer in front of the multiple instances that are hosting your models and distributing the requests among them. Using the autoscaling functionality allows you to keep a smaller instance for low traffic hours, and to be able to scale up during peak traffic hours, and still keep your costs low and your throttle errors to the minimum. See here for documentation on the SageMaker autoscaling options: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

Understanding Cost Estimate for Google Cloud Platform MicroServices Architecture Design

I'm redesigning a monolith application into a MicroServices architecture and am hoping to use Google Cloud Platform (GCP) to host the entire solution. I'm having a very hard time understanding their costing breakdown, and am concerned that my costs will be uncontrollable after I build it. This is for a personal project but I'm hoping will have many users after I launch so I want to get the underlying architecture right and at the same time have reasonable costs initially when I launch.
Here is my architecture:
MicroServices 1 - 4 (Total 4 API Services):
Runs on App Engine
Exposes a REST API and saves data to DataStore
Initially each API should get hit around 200 times a day
MicroService 5 (Events triggered API Service):
Runs on App Engine
Listens for PubSub events and saves to DataStore (basically I have a sensor that pushes data to this Service for storage)
Initially the PubSub should receive events around 200 times a day
MicroService 6-7 (Total 2 UI Services):
Runs on App Engine
These are UIs so people can login and use the systems. The UIs are lightweight frond end apps that use the REST Services above to populate user data in a nice way.
Each UI Service should be used around 3 hours a day
So in Total I have 7 MicroServices with each running as AppEngine "Services" in a single GCP "Project". A DataStore is shared between these APIs within this Project.
As I have 7 App Engine instances running, and they only need to be operational for a short period of time per day, how does the pricing work?
I want to use App Engine because it's completely Managed, which is one of my design requirements. But I'm hoping AppEngine has some kind of Sleep Mode, so that when there is no usage it does not bill?
Any help in understanding what my monthly costs would be would be appreciated.
Thanks very much.
Update 8/2/2017
I've decided to stay out of GCP for now. As I hope to have 7 App Engines Services running in Flex (as they are node.js) I don't seem to get access to a free tier or the ability to scale idle services to 0 instances.
This means I'll be paying full price for these services. (i.e. 7 X Full App Engine VM Cost per Monthly :O )
This is an expense I cant have just for a POC of a proper MicroService design. Instead I'm going to continue with my MicroService design but use a 10$ DigitalOcean box and Dokku to containerise my Services. If this works well and I have a need I will migrate this design to GCP (or AWS)
The full outline of App Engine instance handling is available at https://cloud.google.com/appengine/docs/python/how-instances-are-managed .
In short, your best bet is to enable automatic scaling and set
max_idle_instances = 0
in your app.yaml.
That means that your app will autoscale to handle traffic as needed and shut down the instances afterwards. Also
When settling back to normal levels after a load spike, the number of idle instances can temporarily exceed your specified maximum. However, you will not be charged for more instances than the maximum number you've specified.
Later - when load time becomes more important you can set min_idle_instances to a more suitable number - this allows for responsive apps.
am concerned that my costs will be uncontrollable after I build it
You should be aware that automatically scalable GAE apps always have cost components dependent on the external user request patterns which are not controllable.
For example, in the standard GAE env, the way those 200 requests/day are distributed matters significantly:
if they are evenly distributed they will come in less than 15 min apart - the minimum billed time per instance lifetime, so the respective service will be billed for minimum 24 instance hours per day (very close to the daily 28 free instance-hours/day for billed apps, only a single-service app using the smallest instance class can fit in it).
if they are all received within a 15 minutes interval the service will be billed for 0.5 instance hours daily (which can easily fit in the free daily quota even with multiple services and/or with more powerfull instance classes).
The actual scalability configuration of each service can matter as well. See, for example,
The only way to keep costs under strict control is via the daily budget configuration (but hitting that limit means your app's functionality will be temporarily crippled).
All other usage-based costs being equal due to the functionality being performed you have some (potentially significant) control over costs via:
the GAE environment type selected for each service:
the standard env is billed by instance hours and includes a free daily quota
the flex env has no free daily quota.
the number of services: you could start with fewer services by combining their functionalities (you can still keep them modularized for later split). The expected initial load you describe can easily fit within the free daily budget with just a single standard env service.
Once the app usage picks up and the free daily quotas percentage in the total costs become neglijible you can gradually split the app into multiple services as needed. In general this can be a relatively simple task if the app is properly modularized.

Google App Engine for long running but low CPU tasks, or long-polling?

App Engine has been great for requests that process quickly with no external API calls to databases or caches or third-party resources, but we've found that introducing any sort of "longer running" component or external latency (for example in a HTTP POST operation that runs asynchronously in the background and might take a second or two to process a few more intense database queries... totally invisible and OK from a UX perspective on the client-side because it's asynchronous but expensive to App Engine billing since it's long running) ... the "instance hours" compound and drive costs up considerably.
These sorts of expense inducing situations where a request is literally just waiting for a response from an external resource and requiring almost zero CPU during their idling seem avoidable, but I'm not sure if it's avoidable with App Engine.
It's almost like a "long poll" where the response might be left open but doing nothing.
Is there a way to do this on App Engine without just paying an insane amount for instance hours, or would we be better off moving to Compute Engine or EC2? Does it scale automatically based on CPU load, or is it based solely on open and perhaps inactive requests in total count? — threadsafe is indeed enabled.
There are really two ways to go about this one (top of mind).
Use Task Queues!
If the work doesn't need to be exactly at the same time of the request, this is exactly what [task queues] in App Engine are for. They allow you to put a job on a queue, and have another module pick up the work. They're kind of great because you can separately scale your front end and back end processes.
If that doesn't work....
Use App Engine Flexible
Under the hood App Engine Flexible is just running GCE instances. The cost structure is entirely different, since you persistently have a VM running in the background serving your requests.
Hope this helps!
What you're really worried about here is how App Engine scales your instances. Because many of your requests require few resources, your app might be able to handle many more concurrent requests on a single instance than normal. You can look into parameters that shape scaling here. Of particular interest:
max_concurrent_requests The number of concurrent requests an automatic scaling instance can accept before the scheduler spawns a new instance (Default: 8, Maximum: 80).
There is a danger here, where an instance may fill up with non-long-polling requests and become overburdened. To prevent that, you could isolate your long-polling requests into their own service and set its scaling parameters separately from the rest of your app.

How to configure App Engine for minimal cost?

I'm doing a prototype backend and in the near future I expect little traffic but while testing I consumed all my 300$ free trail.
How can I configure my app to consume the least possible resources? I need things like limiting the number of instances to 1, using a cheap machine, sleep whenever possible, I've read something about Client vs Backend intances.
With time I'll learn the config that best suits me, but now I need the CHEAPEST config to get going.
BTW: I am using managed-vms with Dart.
EDIT
I've been recommended to configure my app.yaml file, what options would you recommend to confront this issue?
There are two train of thought for your issue.
1) Optimization of code: This is very difficult for us as we are not privy to your App's usage and client-base and architecture. In general, it depends on what Google App Engine product you use the most, for example: Datastore API call (fetch, write, delete... etc...), BigQuery and Cloud SQL. Even after optimization, you can still incur a lot of cost depending on traffic.
2) Enforcing cheap operation: This is easier and I think this is what you want. You can manually enforce a daily budget (in your billing setup page) so the App never cost more than a certain amount per day. You can also artificially lower the maximum amount of idling instances to 0 and use the smallest instance possible (F1 for frontend).
For pricing details see this article - https://cloud.google.com/appengine/pricing#Billable_Resource_Unit_Costs
If you use managed VM -- you'll be billed for Compute Engine Instance prices, not for App Engine Instances, and, as I know, the minimum possible instance to use as Managed VM is "g1-small" which costs you $0.023 per hour full sustained usage (if it will be turned on all month), so you minimum bill will be 0.023 * 24 * 30 = $16.56 only for instance hours. Excluding disk and traffic. With minimum amount of datastore operations you may stay on free quota.
Every application consumes resources differently. To minimize your cost, you need to know what resources used the majority of your expenses and go from there.
If it is spent on extra instances that were just sitting there - then trim the number of instances to the minimum required and use a lower class instance. If you are seeing a lot of expense on datastore calls - then look at optimizing your entities and take advantage of memcache.
Lowest Cost for a simple app:
Use App Engine Standard. It scales to zero instances, so will not cost anything if there is no traffic. With App Engine Flex you will pay for the instance hours and the Flex (GCE) instances are bigger.
Use autoscaling with max instances, F1 instance class:
With autoscaling you do not need to guess how many instances you need. F1 are the smallest instances. Set the max instances in case you get DoS'd or more traffic than you can afford.
Stop Instances:
You can stop the App Engine versions when you do not expect the app to be used. The will be no charge for instance hours for either Standard or Flex. For Flex there will be disk charges. The app will be ready to go when you need it again.
App Engine Version Cleanup:
Versions are easy to create and harder to remove. Here is a post on project cleanup. See this post on App Engine cleanup
https://medium.com/google-cloud/app-engine-project-cleanup-9647296e796a

What are the rules of thumb when trying to decide if developing on Google App Engine platform is worthwhile

I have an idea for a web application and I am currently researching different platforms. I am really interested in Google App Engine, but it looks like it works pretty good for certain application types while it is less suitable for others (there are horror as well as success stories e.g. Goodbye Google App Engine vs. Why we are really happy with Google App Engine
There is also a similar negative story in this thread from 1 year ago, concluding GAE was not ready for commercial production platform: GAE as Production Platform. There are also other threads from 2009 talking about data select limits (1000 rows) that has since been lifted.
My app will essentially perform some mathematical analysis based on data pulled from external data feeds (could be some substantial amount of data), it would be real time only the first time data is downloaded for a specific item at hand and then stored and retrieved locally from the database at that point. There will be some additional external data pulls as scheduled intervals.
Based on this brief description, should I even bother starting on GAE? In general, what are the rules of thumb to try and decide if developing on GAE is suitable for a problem at hand? Also, what are the good examples of Apps in Production that use GAE. It looks like GAE App Gallery is not around anymore, but I would definitely appreciate any Web 2.0 App examples running on the app engine.
In your specific case I would double check these factors:
a. Is the mathematical analysis a long running CPU intensive job?
GAE is not designed for long running CPU intensive computational Jobs; this would lead to have an high billing cost and would force you to design your application to avoid some GAE limitations (10 minutes max per job, limited soft memory, CPU quota, etc. etc.).
b. Are you planning to retrieve external data using a mainstream API (twitter, yahoo, facebook)?
Your application shares the same pool of IPs with other applications; if the API you want to adopt does not allow authenticated request, your application will suffer hiccups caused by throttling/quota limits errors. I faced this problem here.
App Engine should work fine for your application. It's generally designed to serve, and to scale, sites that serve mostly user-facing traffic. Applications that it's not suitable for are things such as video transcoding, which rely heavily on backend processing, or things that have to shell out to native code, such as 3D graphics, etcetera.
Depends on what type of mathematical analysis are you doing. If your application is heavy in I/O, I would give it some pause. On GAE, you're kind of limited in your I/O options. You basically have the following:
RAM: I can't recall exactly, but GAE imposes a hard limit of around 200MB of RAM.
Datastore: You get plenty of space here, but it's slow compared to a cached local file system.
Memcache: Faster than datastore, but not nearly as fast as a cached disk. And worse, it's a cache, so there's no guarantee that it won't get wiped out.
External sources: These include calling out to external web-pages. Lots of flexibility, but very slow.
In sum, I would perhaps look at other options if you're doing heavy I/O on a medium-size dataset (>20MB and ~<2GB). These are probably non-issues for 90% of web-apps, although you should be aware of them.
All the negatives aside, working on GAE is a joyous experience. You spend more time programming and less time configuring. And it's really cheap.

Resources