What AWS quotas do I need for SageMaker Asynchronous Inference Endpoints to be able to scale? - amazon-sagemaker

I want to deploy a model using an Asynchronous Inference endpoint which will auto-scale. However, I cannot find the information about what quotas are required for this to work without running out of resources.
Does scaling require some specific type of quotas, so that multiple jobs can be executed in parallel on different instances of the inference container?
It really isn't clear in the documentation whether quotas apply to Asynchronous Inference endpoints or not. Clearly, they apply to real-time inference endpoints, but Asynchronous Inference documentation does not seem to mention about it at all...

AutoScaling with Async endpoints is not different than autoscaling with other Inference options i.e your AWS Quotas need to reflect the right amount of instances you wish to scale to. For instance, if you configure the min and maximum instance count in your Async autoscaling config shown below you would need a 5 instances available at your disposal. [ Reference ]
response = client.register_scalable_target(
ScalableDimension='sagemaker:variant:DesiredInstanceCount', # The number of EC2 instances for your Amazon SageMaker model endpoint variant.
NOTE - I work for AWS SageMaker, but my opinions are my own.


Best configuration for Automatic Scaling in Google App Engine to always have an instance available?

What is the best way to set up Google App Engine to always have at least one instance ready and available to handle requests when using automatic scaling? This is for a low traffic application.
There are settings here that allow you to control them but I am not sure what the best combination is and some of them sound confusing. For example, min-instances and min-idle-instances sound similar. I tried setting min-instances to 1 but I still experienced lag.
What is a configuration that from the end user's point of view is always on and handles requests without lag (for a low traffic application)? I
In the App Engine Standard environment, when your application load or handling a requests this may cause the users to experience more latency, however warmup requests might help you reduce this latency. Before any live requests get to that instance, warmup requests load the app's code onto a new one. If this is enabled, App Engine will detect if your application needs a new instance and initiate a warmup request to initialize a new instance. You can check this link for Configuring Warmup Requests to Improve Performance
Regarding min-instances and min-idle-instances these will only apply when warmup request is enabled. As you can see in this post the difference of these two elements: min-instances used to process the incoming request immediately while min-idle-instances used to process high load traffic.
However, you mentioned that you don't need a warmup so we suggest you to select App Engine Flexible and based on this documentation it must have at least one instance running and can scale up in response to traffic. Please take note that using this environment costs you a higher price. You can refer to this link for reference regarding the pricing of two environments in App Engine.

Why is configure autoscaling greyed out on sagemaker

I am trying to configure autoscaling for my endpoint on sagemaker. However I cannot do it in the UI as the button is greyed out after I click on a model variant. I do however have another model setup before where the button is not greyed out.
Why is this not available for some models on sagemaker but it is for others?
Any type of model hosted on SageMaker endpoint does support autoscaling. Its not restricted by model types either. How ever i can think of the following reasons due to which it might be disabled for one of the endpoints.
Are you seeing these two models in the same AWS account and using the same IAM role to update the autoscaling policies? If not, that is the issue. Update your IAM policies and the button shall work.
SageMaker doesn't support autoscaling for burstable instances such as T2, because they already allow for increased capacity under increased workloads. Is your endpoint in question running on these burstable type instances? If yes, you know the reason now. Just change the instance to any other type and it should work fine.

Google App Engine streaming data into Bigquery: GCP architecture

I'm working with a Django web app deployed on Google App Engine flexible environment.
I'm streaming my data while processing requests in my views using bigquery.Client(). But I think it is not the best way to do it. Do I need to delegate this process outside of the view (using pub/sub, tasks, cloud functions etc.? If so, give me a suitable architecture: which GCP product should I use, how to connect, and what to read.
Based on your comment, I could recommend you Cloud Run;
Cloud Run is a serverless container based product. You write a webserver (that handle your POST request), wrap it in a container and deploy it on Cloud Run.
With a brand new feature, named always on the CPU is not throttled after the response sent (the normal behavior). With always on, you keep the full CPU up to the Cloud Run instances off load (usually after 15 minutes, but can be quicker).
The benefit of the feature is the capacity to return immediately the response to the client, and then to continue to process, asynchronously, your data to store in BigQuery (in streaming mode).

What are the costs involved in triggering functions in Google App Engine?

I am interested in triggering notifications into my Salesforce system when reviews are posted on my Google Business profile.
Using this article, it describes being able to call the Salesforce API by executing Python code using Google's App Engine, leaving very little for me to code myself but I'm not familiar with how Google works in terms of billing.
Essentially when a topic/subscription is triggered, Google will execute a small python script to call the Salesforce API. I want to know how much will it cost to do this, or how can I calculate how this will be billed?
You must be referring to Cloud Functions because that's the product used in the article (also, Cloud Functions has Pub/Sub event trigger). App Engine and Cloud Functions are both serverless products but they have different use cases. If you wish to know the difference between them, here's a good SO answer.
You are billed depending on how long the function takes to finish, how many times you invoke it, and the type of resources configured on the function (the higher the CPU & memory is, the higher the cost). There's additional charges as well if your function makes outbound data transfer whenever you invoke the function.
You can estimate your monthly cost by checking out GCP pricing calculator. Also, there are several products involved on the article (such as Secret Manager and Pub/Sub) so take note of adding these products on your estimate. For additional pricing details, check out the docs: https://cloud.google.com/functions/pricing
I also have another advice though this is not about the pricing when invoking the function but more of the additional when deploying it. From GCP docs:
When you deploy your function's source code to Cloud Functions, that source is stored in a Cloud Storage bucket. Cloud Build then automatically builds your code into a container image and pushes that image to Container Registry. Cloud Functions accesses this image when it needs to run the container to execute your function.
This apply on newer runtimes. It may be noticeable on your billing if you frequently deploy/update a function because Cloud Build has to re-build those images and those images has to be stored on a Cloud Storage bucket. Functions take time to deploy so it's best practice to test your functions first locally. A solution is to use Functions Framework.

Can Google App Engine Memcache Standard be accessed from an external server

I am trying to figure out how to access Google App Engine Memcache service from outside Google App Engine. Any help on how this can be done would be greatly appreciated.
Thanks in advance!
I don't think this is currently possible. I don't know if there is any technical argument for this or if this decision has been made simply for billing purposes. But it seems like memcache is intended to be an integral part of App Engine. The only relevant discussion I could find is this feature request. It calls for possibility of accesing memcached data of one App Engine project by another App Engine project. It seems to me that Google didn't consider such functionality to be beneficial. You could try filing your own feature request to make memcache a standalone service. In case you do not succeed (and I am afraid you won't), here is a simple workaround.
A simple workaround:
Create a simple App Engine project which would serve as a facade over memcache service. This dummy App Engine project would simply translate your HTTP requests to memcache API calls and return the obtained data in the body of a HTTP response. For example, to retrieve a memcache record you could send a GET request such as:
This call would get "translated" into:
And the obtained data appended to the HTTP response.
Since accessing memcache is free, you would only have to pay for instance time. I don't know what through-put are you expecting, but I can imagine scenarios where you could even fit into the free daily quota (currently 28 hours/day). All in all, the intermediate App Engine project should not come with significant cost in neither performance nor price.
Before using this workaround:
The above snippet of code is intended for illustration purposes only. There still remain some issues to be dealt with before using this approach in production. For example, as pointed out by Suken, anyone would be able to access your memcache if they knew what requests to send. Here are four additional things I would personally do:
Address the security issues by sending some authentication token with each request. An obvious necessity would be to make the calls over HTTPS to prevent man-in-the-middle attackers from obtaining this token. Note that App Engine's appspot.com subdomains are accessible via HTTPS by default.
Prefer batch API calls such as getAll() over their single record alternatives such as get(). Retrieving multiple records in one batch call is much faster than making multiple separate API calls.
Use POST requests (instead of GET) to access the facade application. You won't have to worry about your batch requests being to large. I only used GET request in the example above because it was easier to write.
Check if such usage of App Engine doesn't violate the Terms of Service. Personally, I don't believe it does. And I don't see why Google should mind. After all, you will be paying for instance hours.
EDIT: After giving this some more thought, I believe that the suggested workaround is actually what Google presumes you to do. Given that the Goolge's objective is to earn money, it would be unreasonable to provide a free service unless it was a part of a paid one. Of course, another billing schemes could be created. For example, allowing direct access only for developers who are willing to pay for dedicated memcache. The question is whether your use case is broad enough to convince Google to take some action.
No, AFAIK the Memcache service is not available outside GAE. To be even more specific it is only available inside the GAE standard environment, it is unavailable in the GAE flexible environment.
But some of the alternate solutions suggested for GAE flexible users might be useable for you as well. From Memcache:
The Memcache service is currently not available for the App Engine
flexible environment. An alpha version of the memcache service will be
available shortly. If you would like to be notified when the service
is available, fill out this early access form.
If you need access to a memcache service immediately, you can use the
third party memcache service from Redis Labs. To access this service,
see Caching Application Data Using Redis Labs Memcache.
You can also use Redis Labs Redis Cloud, a third party fully-managed
service. To access this service, see Caching Application Data Using
Redis Labs Redis.
As stated by other users the Memcache is not offered as a service outside GAE (Google App Engine). I would like to point out that implementing GAE facade over Memcache service has security ramifications. Please note that facade GAE Memcache app will be exposed on the public internet like any other GAE service. I am assuming that you want to use Memcache for internal use only. Another aspect to think about is writing into memcache. If you intend to write to memcache from outside GAE, then definitely avoid facade implementation. If comprised anyone will be able to use you facade implementation as their own cache without paying for it ;)
My suggestion is to spin up a stack using GCP Cloud Launcher. There are various stack templates available for both Redis and Memcache stacks. Further you can configure the template to use preemptible burstable instances to reduce the cost of your Memcache.