How can I scale-in Sagemaker endpoint?

How can I scale-in Sagemaker endpoint? - amazon-sagemaker

I have deployed a sklearn model in AWS Sagemaker using sklearn.deploy method, for auto-scaling the endpoint. I've set the following configuration:
Target value for number of requests: 25
Scale out cool time: 30 sec
Scale in cool time: 20 sec
After receiving sending 25+ requests a new instance is deployed. But after this even when I don't send new requests to the endpoint it is not scaling down automatically.
Why is it not scaling down?
How can I make it auto-scale down when no new requests are received for a fixed time interval.

As of the writing of this post, SageMaker will not scale down to 0.
You must also specify the minimum number of instances for the model. This value must be at least 1, and equal to or less than the value specified for the maximum number of endpoint instances.
Source: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-prerequisites.html

Related

Regarding API rate limit. Is is for a single app or a single user?

In Coinbase API doc, it describes
"By default, each API key or app is rate limited at 10,000 requests per hour if your requests are being rate limited. HTTP response code 429 will be returned with an rate_limit_exceeded error"
[Question] I'd like to know that whether the current API restrictions are for a single app or a single user.
Thanks in advance

As you already mentioned, the limit is linked to the API key.
So if you have multiple apps using the same API key(ideally you should not), the limit will apply to the cumulative calls from all the apps. If you have separate apps using different keys, the limit will apply to each app.

In addition to Anupam's answer, if you use the exchange api there are different rates and limited by IP.
REST API
When a rate limit is exceeded, a status of 429 Too Many Requests will be returned.
Public endpoints
We throttle public endpoints by IP: 10 requests per second, up to 15 requests per second in bursts. Some endpoints may have custom rate limits.
Private endpoints
We throttle private endpoints by profile ID: 15 requests per second, up to 30 requests per second in bursts. Some endpoints may have custom rate limits.
/fills endpoint has a custom rate limit of 10 requests per second, up to 20 requests per second in bursts.

App Engine for cloud monitoring metrics throwing 500 error when writing to big query

I want to export Metrics from Cloud Monitoring to Big Query and google has given a solution on how to do this. I am following this this article.
I have downloaded the code from github and I am able to successfully deploy and run the application (python2.7),
I have given the aggregate alignment period as 86400s (I want to aggregate metrics per day starting from 1st July)
One of the app-engines the write-metrics app engine which writes the metrics to the big query, by getting the api response as a pub-sub message is always throwing me these errors:
> Exceeded soft memory limit of 256 MB with 270 MB after servicing 5 requests total. Consider setting a larger instance class in app.yaml.
> While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application or may be using an instance with insufficient memory. Consider setting a larger instance class in app.yaml.
The above is 500 error and very frequent and I find that duplicate records are getting inserted in the table in BigQuery still
and also this one below
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
The app engine logs frequently show POST with codes 500 and 200
In app engine(standard) I have added the scaling as automatic and set in app.yaml as below:
automatic_scaling:
target_cpu_utilization: 0.65
min_instances: 5
max_instances: 25
min_pending_latency: 30ms
max_pending_latency: automatic
max_concurrent_requests: 50
but this seems to have no effect. I am very new to app engine,google-cloud and its stackdriver metrics.

This change makes it work
instance_class: F4_1G
This needs to be as a independent tag and previously i had made the mistake of putting under the automatic scaling: so it gave illegal modifier

Can GAE be configured to launch a new instance per request?

I'm building a data processing application, where I want an incoming (REST) request to cause a cloud instance to be started, do some processing, then retrieve the results. Typically, it would be something like this:
receive request
start instance
send request to instance
instance processes (~100% load on all instance CPUs)
poll service running on instance for status
fetch results from instance
shut down instance
I was planning on doing the instance management manually using something like jclouds, but am wondering if GAE could be configured to do something like this (saving me work).
If I have my processing service set up in GAE, can I make it so that a new instance is launched for every incoming request (or whenever the current instance(s) are at 100% CPU usage)?

Referring to instance management only (i.e. 1-4 and 7)...
From Scaling dynamic instances:
The App Engine scheduler decides whether to serve each new request
with an existing instance (either one that is idle or accepts
concurrent requests), put the request in a pending request queue, or
start a new instance for that request. The decision takes into account
the number of available instances, how quickly your application has
been serving requests (its latency), and how long it takes to spin up
a new instance.
Each instance has its own queue for incoming requests. App Engine
monitors the number of requests waiting in each instance's queue. If
App Engine detects that queues for an application are getting too long
due to increased load, it automatically creates a new instance of the
application to handle that load.
App Engine also scales instances in reverse when request volumes
decrease. This scaling helps ensure that all of your application's
current instances are being used to optimal efficiency and cost
effectiveness.
So in the scaling configuration I'd keep automatic_scaling (which is the default) and play with:
max_pending_latency:
The maximum amount of time that App Engine should allow a request
to wait in the pending queue before starting a new instance to handle
it. The default value is "30ms".
A low maximum means App Engine will start new instances sooner for pending requests, improving performance but raising running costs.
A high maximum means users might wait longer for their requests to be served (if there are pending requests and no idle instances to
serve them), but your application will cost less to run.
min_pending_latency:
The minimum amount of time that App Engine should allow a request to
wait in the pending queue before starting a new instance to handle it.
A low minimum means requests must spend less time in the pending queue when all existing instances are active. This improves
performance but increases the cost of running your application.
A high minimum means requests will remain pending longer if all existing instances are active. This lowers running costs but increases
the time users must wait for their requests to be served.
See also in Change auto scaling performance settings:
Min Pending Latency - Raising Min Pending Latency instructs App Engine’s scheduler to not start a new instance unless a request
has been pending for more than the specified time. If all instances
are busy, user-facing requests may have to wait in the pending queue
until this threshold is reached. Setting a high value for this setting
will require fewer instances to be started, but may result in high
user-visible latency during increased load.
You may also want to take a look at Warmup requests, in case you want to reduce the latency for requests which would cause a new instance to be started.

GAE: What's the difference between <min-pending-latency> and <max-pending-latency>?

As far as I can read the docs, both settings do the same thing: start a new instance when a request has spent in pending queue longer than that setting says.
<max-pending-latency> The maximum amount of time that App Engine should allow a request to wait in the pending queue before starting a new instance to handle it. Default: "30ms".
A low maximum means App Engine will start new instances sooner for pending requests, improving performance but raising running costs.
A high maximum means users might wait longer for their requests to be served, if there are pending requests and no idle instances to serve them, but your application will cost less to run.
<min-pending-latency>
The minimum amount of time that App Engine should allow a request to wait in the pending queue before starting a new instance to handle it.
A low minimum means requests must spend less time in the pending queue when all existing instances are active. This improves performance but increases the cost of running your application.
A high minimum means requests will remain pending longer if all existing instances are active. This lowers running costs but increases the time users must wait for their requests to be served.
Source: https://cloud.google.com/appengine/docs/java/config/appref
What's the difference between min and max then?

The piece of information you might be missing to understand these settings is that App Engine can choose to create an instance at any time between min-pending-latency and max-pending-latency.
This means an instance will never be created to serve a pending request before min-pending-latency and will always be created once max-pending-latency has been reached.
I believe the best way to understand is to look at the the timeline of events when a request enters the pending queue:
A request reaches the application but no instance are available to serve it so it is placed in the pending requests queue.
Until the min-pending-latency is reached: App Engine tries to find an available instance to serve the request and will not create a new instance. If a request is served below this threshold, it is a signal for App Engine to scale down.
After the min-pending-latency is reached and until max-pending-latency is reached: App Engine tries to find an available instance to serve the request.
After the max-pending-latency is reached: App Engine stops searching for an available instance to serve the request and creates a new instance.
Source: app.yaml automatic_scaling element

How does App Engine's new "Minimum Pending Latency" setting affect warmup requests?

How does App Engine's new "Minimum Pending Latency" setting affect warmup requests?
If I set a "Minimum pending latency" of 10 seconds, will my app still start up instances with warmup requests even if the pending latency never reaches that high?
My hope is that it will, because my app's cold start time is about 15 seconds, and I was hoping that by setting a high "minimum pending latency", it would make it so it doesn't try to start an instance on a user facing request (making the user wait 15 seconds), but it will still start up instances in the background with warmup requests.

The App Engine runtime will not start new instances as long as your pending latency is below the value you specify. This includes warmup requests. Once latency exceeds the value you specify, the runtime will start up new instances, using warmup requests when possible.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight