TorchServe on SageMaker scaling and latency issues - amazon-sagemaker

We’re using TorchServe on SageMaker with the deep learning containers ( https://github.com/aws/deep-learning-containers/blob/master/available_images.md).
Everything seems to be running well, we have very minimal CPU and memory usage, and our latency of requests can be about 3 seconds (which is quite long but okay). Once we start to ramp up usage our model latency increases dramatically - taking up to 1 minute to respond, however, our CPU and memory usage is still minimal... we feel like the TorchServe instance is not scaling properly on the instance and only deploying one worker when it should be at least 4 on a 4vcpu system (ml.m5xlarge instance). We have been following official AWS guides.
How can we resolve this?
We work with large files (1-20Mb images). However, SageMaker has a limitation of 5Mb payload size. To get around this we are using S3 to store our intermediary files (which potentially causes long latency). Our images go through preprocessing, enhancing, segmentation, and postprocessing, which are currently all individual SageMaker endpoints that a lambda function controls. We want to use SageMaker pipelines, but we see no incentive to make changes to our system with the payload limit.
Is there a better way of doing this?
We have been experimenting with Multi-Model instances. Ideally, we would like to have an ml.g4dnxlarge run all our PyTorch models to benefit from the low latency and high throughput. Still, when we deploy a torch serve multimode endpoint with deep learning containers, we get an error when invoking "model version not selected" ???? I have looked everywhere for anyone having similar issues but can't see anything.
Are there any guides for torch serve multi-model instances? Are we making an obvious mistake?
We have experimented with elastic inference and AWS Inf instances. However, we have seen no latency decrease from the standard CPU instance - again,
are we making a mistake here, and the benefits from these instances seem beneficial to us, and we would love to use them?

Related

Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?
I'm afraid that several issues will slow things down or even make them infeasible:
the interruption detection lag
the increased probability of interruption (N instances)
the need to re-download data at every interruption
the need start/stop whole clusters instead of just replacing interrupted nodes
the fact that Sagemaker doesn' support variable size cluster
Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."
Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?
Short answer is that Spot training works well when the instance type you need, in the region you need, has enough free capacity, at a particular time. Otherwise you won't be able to start the job, or get too frequent interruptions.
Why not just try it for yourself? Once you have a working on-demand training job, you can enable spot training by adding 3 relevant parameters to the job's Estimator definition, and implement checkpoint save/load (good to have anyway). Then if it works well, great! If not, switch back.

How to handle single large request spike to SageMaker endpoint

I know SageMaker endpoints have autoscaling as an option, but from my understanding that mainly applies when there is a sustained high request volume. We have the issue that on occasion there will be a huge sudden single spike and then go back to normal. Is autoscaling fast enough (what's the delay?) for it to handle that? Or does it need to actually spin up another instance? Would having two instances at the endpoint help it respond immediately to an isolated request spike? I'm just not clear on what the response delay is for autoscaling, and I have do not see this mentioned in their posts/documentation. Thanks
for your question about time sagemaker need to start new instance is that the time can vary depending on the model size, how long it takes to download the model, and the start-up time of the container.
now what options you have :
use hw utilization let say when reach 76% of cpu scale out
another option to use step scale and for example use OverheadLatency
to scale out
here is a good resource for the above https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/
and you can always do load testing before you choice the best strategy that fit your need , check this url for load testing
https://aws.amazon.com/blogs/machine-learning/load-test-and-optimize-an-amazon-sagemaker-endpoint-using-automatic-scaling/
also , i think using aws sagemaker serverless is good solution for your case but it is in preview stage, right now
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-serverless-inference/

Identify why Google app engine is slow

I developed an application for client that uses Play framework 1.x and runs on GAE. The app works great, but sometimes is crazy slow. It takes around 30 seconds to load simple page but sometimes it runs faster - no code change whatsoever.
Are there any way to identify why it's running slow? I tried to contact support but I couldnt find any telephone number or email. Also there is no response on official google group.
How would you approach this problem? Currently my customer is very angry because of slow loading time, but switching to other provider is last option at the moment.
Use GAE Appstats to profile your remote procedure calls. All of the RPCs are slow (Google Cloud Storage, Google Cloud SQL, ...), so if you can reduce the amount of RPCs or can use some caching datastructures, use them -> your application will be much faster. But you can see with appstats which parts are slow and if they need attention :) .
For example, I've created a Google Cloud Storage cache for my application and decreased execution time from 2 minutes to under 30 seconds. The RPCs are a bottleneck in the GAE.
Google does not usually provide a contact support for a lot of services. The issue described about google app engine slowness is probably caused by a cold start. Google app engine front-end instances sleep after about 15 minutes. You could write a cron job to ping instances every 14 minutes to keep the nodes up.
Combining some answers and adding a few things to check:
Debug using app stats. Look for "staircase" situations and RPC calls. Maybe something in your app is triggering RPC calls at certain points that don't happen in your logic all the time.
Tweak your instance settings. Add some permanent/resident instances and see if that makes a difference. If you are spinning up new instances, things will be slow, for probably around the time frame (30 seconds or more) you describe. It will seem random. It's not just how many instances, but what combinations of the sliders you are using (you can actually hurt yourself with too little/many).
Look at your app itself. Are you doing lots of memory allocations in the JVM? Allocating/freeing memory is inherently a slow operation and can cause freezes. Are you sure your freezing is not a JVM issue? Try replicating the problem locally and tweak the JVM xmx and xms settings and see if you find similar behavior. Also profile your application locally for memory/performance issues. You can cut down on allocations using pooling, DI containers, etc.
Are you running any sort of cron jobs/processing on your front-end servers? Try to move as much as you can to background tasks such as sending emails. The intervals may seem random, but it can be a result of things happening depending on your job settings. 9 am every day may not mean what you think depending on the cron/task options. A corollary - move things to back-end servers and pull queues.
It's tough to give you a good answer without more information. The best someone here can do is give you a starting point, which pretty much every answer here already has.
By making at least one instance permanent, you get a great improvement in the first use. It takes about 15 sec. to load the application in the instance, which is why you experience long request times, when nobody has been using the application for a while

Project suited for Google App Engine?

I'm currently developing a small hobby project (open sourced at https://github.com/grav/mailbum) which quite simply takes images from a Gmail account and puts them in albums on Picasa Web.
Since it's (currently) only dealing with Google-hosted data, I was thinking about hosting it on Google App Engine, but I'm not sure if it's well-suited for GAE:
Will the maximum execution time be a problem? It's currently 10 minutes according to http://googleappengine.blogspot.com/2010/12/happy-holidays-from-app-engine-team-140.html, but I'd think the tasks (i.e. processing a single mail) would be easy to run in parallel. I'm also guessing that dealing with Google-hosted data would be quite efficient on GAE?
Will the fact that it's written in Clojure be an obstacle? I've researched a bit in getting Clojure to run on GAE, but I've never tried it. Any pin-pointers?
Thanks for any advice and thoughts on the project!
It seems like your application is doable on GAE. My points of concern would be:
Does your code ever store the images that it is processing to temporary files? If so it will need to be changed to do everything in memory, because GAE applications are sandboxed and not allowed to write to the filesystem (if you need temporary persistent storage, you might be able to work something out where you write your file data to a BLOB field in the GAE datastore).
How do you get the images into Picasa Web? If they provide a simple REST/HTTP API then all is well. If you need something more involved than that (like a raw TCP socket) then it won't work.
The 10-minute execution time limit only applies to background tasks. When actually servicing web requests the time limit is 30 seconds. So if you provide a web-based interface to your app, you need to structure things so that the interface is just scheduling jobs that run in the background (i.e. you can't fire off a job directly as part of servicing a web request).
If none of those sound like show-stoppers to you, then I think your app should work just fine on GAE.
Can't really say if Clojure will work though. I have, however, spent time in the past getting some third-party libraries to work on App-Engine. Generally all I had to do was remove/modify/disable any parts of the library that accessed features that are forbidden by the sandbox (for instance, I had to disable the automatic caching to disk to get commons-fileupload to work on GAE). Not sure if the same would apply to Clojure, or even what the scope would be on a task like that.
I have been dabbling with Clojure and App Engine for a while now and I have to recommend appengine-magic. It abstracts most of the Java stuff away and is very easy to use. As a plus the project seems to be very active.

What are the rules of thumb when trying to decide if developing on Google App Engine platform is worthwhile

I have an idea for a web application and I am currently researching different platforms. I am really interested in Google App Engine, but it looks like it works pretty good for certain application types while it is less suitable for others (there are horror as well as success stories e.g. Goodbye Google App Engine vs. Why we are really happy with Google App Engine
There is also a similar negative story in this thread from 1 year ago, concluding GAE was not ready for commercial production platform: GAE as Production Platform. There are also other threads from 2009 talking about data select limits (1000 rows) that has since been lifted.
My app will essentially perform some mathematical analysis based on data pulled from external data feeds (could be some substantial amount of data), it would be real time only the first time data is downloaded for a specific item at hand and then stored and retrieved locally from the database at that point. There will be some additional external data pulls as scheduled intervals.
Based on this brief description, should I even bother starting on GAE? In general, what are the rules of thumb to try and decide if developing on GAE is suitable for a problem at hand? Also, what are the good examples of Apps in Production that use GAE. It looks like GAE App Gallery is not around anymore, but I would definitely appreciate any Web 2.0 App examples running on the app engine.
In your specific case I would double check these factors:
a. Is the mathematical analysis a long running CPU intensive job?
GAE is not designed for long running CPU intensive computational Jobs; this would lead to have an high billing cost and would force you to design your application to avoid some GAE limitations (10 minutes max per job, limited soft memory, CPU quota, etc. etc.).
b. Are you planning to retrieve external data using a mainstream API (twitter, yahoo, facebook)?
Your application shares the same pool of IPs with other applications; if the API you want to adopt does not allow authenticated request, your application will suffer hiccups caused by throttling/quota limits errors. I faced this problem here.
App Engine should work fine for your application. It's generally designed to serve, and to scale, sites that serve mostly user-facing traffic. Applications that it's not suitable for are things such as video transcoding, which rely heavily on backend processing, or things that have to shell out to native code, such as 3D graphics, etcetera.
Depends on what type of mathematical analysis are you doing. If your application is heavy in I/O, I would give it some pause. On GAE, you're kind of limited in your I/O options. You basically have the following:
RAM: I can't recall exactly, but GAE imposes a hard limit of around 200MB of RAM.
Datastore: You get plenty of space here, but it's slow compared to a cached local file system.
Memcache: Faster than datastore, but not nearly as fast as a cached disk. And worse, it's a cache, so there's no guarantee that it won't get wiped out.
External sources: These include calling out to external web-pages. Lots of flexibility, but very slow.
In sum, I would perhaps look at other options if you're doing heavy I/O on a medium-size dataset (>20MB and ~<2GB). These are probably non-issues for 90% of web-apps, although you should be aware of them.
All the negatives aside, working on GAE is a joyous experience. You spend more time programming and less time configuring. And it's really cheap.

Resources