How to enable server-side batching on SageMaker PyTorch TorchServe endpoints? - amazon-sagemaker

How do I enable server-side batching on SageMaker PyTorch TorchServe endpoints?
Can't seem to find relevant documentation about this around.

In TorchServe itself, there are two configuration parameter that setup the server-side batching
batchSize: This is the maximum batch size in ms that a model is expected to handle.
maxBatchDelay: This is the maximum batch delay time TorchServe waits to receive batch_size number of requests. If TorchServe doesn’t receive batch_size number of requests before this timer time’s out, it sends what ever requests that were received to the model handler.
With the recent update of PyTorch Inference Toolkit for SageMaker (October 2021), these variables have been exposed to user via environmental variables:
batchSize is set by SAGEMAKER_TS_BATCH_SIZE
maxBatchDelay is set by SAGEMAKER_TS_MAX_BATCH_DELAY
All in all, an example how to set up server-side batching using SageMaker Python SDK:
from sagemaker.pytorch.model import PyTorchModel
env_variables_dict = {"SAGEMAKER_TS_BATCH_SIZE": "3","SAGEMAKER_TS_MAX_BATCH_DELAY": "100000"}
pytorch_model = PyTorchModel(
See also this AWS ML blog post:
Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker


Publishing message to GCP pubsub using API is time Consuming

I have a node js app in mongodb cloud platform,which will be used for posting 1 million messages to a topic in GCP pubsub.Since the platform is not supporting the npm package #google-cloud/pubsub,we implemented it using the API reference for Pubsub.Upon load testing the app,I can see each message is taking 50 seconds for posting it to the topic.Ideally it should take less than 5 secs.It takes around 30 seconds for the access_token API call and 20 seconds for the message posting API call.Since each message posting is a independent event,we cannot maintain a session to store the access_token and reuse it and API_KEY authentication method is not available for GCP PubSub.Is the API method for gcp pubsub is very slow when compared to using library #google-cloud/pubsub ?.
Can anyone suggest a solution to improve performance of GCP PubSub using APIs
The PubSub client library are greatly optimized in several ways. The first one is the use of gRPC protocol instead of REST API. Then, there is message aggregation before a push to PubSub (500ms of wait by default). Then, there is various async mechanism to parallelize the processing.
So, a huge and great work done by the Client Library teams and hard (or expensive) to reproduce on your side. But you can, the sources are public, you can have a look to the client libraries!
The 30s for the access_token retrieval is too long. Are you sure that you haven't network issue? In any case, this token is valid for 1H. If you can reuse it in your subsequent call you will save a lot of time!

Cloud Tasks client ignores retry configuration

Basically what the title says. The API and client docs state that a retry can be passed to create_task:
retry (Optional[google.api_core.retry.Retry]): A retry object used
to retry requests. If ``None`` is specified, requests will
be retried using a default configuration.
But this simply doesn't work. Passing a Retry instance does nothing and the queue-level settings are still used. For example:
from google.api_core.retry import Retry
from import CloudTasksClient
client = CloudTasksClient()
retry = Retry(predicate=lambda _: False)
client.create_task('/foo', retry=retry)
This should create a task that is not retry. I've tried all sorts of different configurations and every time it just uses whatever settings are set on the queue.
You can pass a custom predicate to retry on different exceptions. There is no formal indication that this parameter prevents retrying. You may check the Retry page for details.
Google Cloud Support has confirmed that task-level retries are not currently supported. The documentation for this client library is incorrect. A feature request exists here
Task-level retry parameters are available in the Google App Engine bundled service for task queuing, Task Queues. If your app is on GAE, which I'm guessing it is since your question is tagged with google-app-engine, you could switch from Cloud Tasks to GAE Task Queues.
Of course, if your app relies on something that is exclusive to Cloud Tasks like the beta HTTP endpoints, the bundled service won't work (see the list of new features, and don't worry about the "List Queues command" since you can always see that in the configuration you would use in the bundled service). Barring that, here are some things to consider before switching to Task Queues.
Supplier preference - Google seems to be preferring Cloud Tasks. From the push queues migration guide intro: "Cloud Tasks is now the preferred way of working with App Engine push queues"
Lock in - even if your app is on GAE, moving your queue solution to the GAE bundled one increases your "lock in" to GAE hosting (i.e. it makes it even harder for you to leave GAE if you ever want to change where you run your app, because you'll lose your task queue solution and have to deal with that in addition to dealing with new hosting)
Queues by retry - the GAE Task Queues to Cloud Tasks migration guide section Retrying failed tasks suggests creating a dedicated queue for each set of retry parameters, and then enqueuing tasks accordingly. This might be a suitable way to continue using Cloud Tasks

How to expose Hystrix jmx for Prometheus

I'm new to Hystrix and I just created my first Hystrix Commands. The commands are being created and executed in a loop so the metrics data should have being registered. I am using the servo metrics publisher as follows:
Looking at the JConsole I found the related metrics definition as follows in the link:
I am not using spring, eureka, servo to read data and run the app.
I would like to know how to expose this data in a way that prometheus can read. I tried hystrix-prometheus, but the documentation is not helpful when it is about where the metrics are being exposed, how to get them or check the them.
In order to retrieve Hystrix metrics, you'll first need to get Prometheus' Java Simple Client up and running. The setup depends on your environment. Independent of your environment the result should be a URL where you can retrieve i.e. simple Java metrics.
Once that it up and running, you can use the line
to register the additional Hystrix metrics. They will be served by the same URL. Please note that you will see Hystrix metrics only after the first call of a Hystrix enabled command.

Long-running script on Google App Engine

I'm attempting to create a microservice on Google App Engine that is not intended to handle HTTP requests.
Instead, I was hoping to have a continuously running Python script that monitors a remote queue--RabbitMQ, to be precise--and sends out an api-call to another service as tasks are pushed to the queue.
I was wondering, firstly, is it possible to run a script upon deployment--one that did not originate with a user action/request?
Secondly, how would I accomplish this?
Thanks in advance for your time!
You can deploy your "script" as a manually scaled module -- see -- with exactly one instance. As the docs say, "When you start a manual scaling instance, App Engine immediately sends a /_ah/start request to each instance"; so, just set that module's handler for /_ah/start to the handler you want to run (in the module's yaml file and the WSGI app in the Python code, using whatever lightweight framework you like -- webapp2, falcon, flask, bottle, or whatever else... the framework won't be doing much for you in this case save the one-off routing).
Note that the number of free machine hours for manual scaling modules is limited to 8 hours per day (for the smaller, B1 instance class; proportionally fewer for larger instance classes), so you may need to upgrade to paid-app status if you need to run for more than 8 hours.
Like #brant said, App Engine is designed to handle HTTP requests. It's not a perfect fit for background jobs, unless you try to wrap your logic into one http request.
Further, App Engine will emit an error when the response timeout, depending on your scaling settings. If you want to try it, consider basic or manual scaling.
For this type of workload, I would suggest you use a VM.
I think there are a few problems with this design.
First, App Engine is designed to be an HTTP request processor, not a RabbitMQ message processor. GAE is intended for many small requests, not one long-running process.
Second, "RabbitMQ should not be exposed to the public internet, it wasn't created for such use case."
I would recommend that you keep the RabbitMQ clients on the same internal network as the RabbitMQ broker, and have the clients send HTTP requests to App Engine.

How to set deadline for BigQuery on Google App Engine

I have a Google App Engine program that calls BigQuery for data.
The query usually takes 3 - 4.5 seconds and is fine but sometimes takes over five seconds and throws this error:
DeadlineExceededError: The API call urlfetch.Fetch() took too long to respond and was cancelled.
This article shows the deadlines and the different kinds of deadline errors.
Is there a way to set the deadline for a BigQuery job to be above 5 seconds? Could not find it in the BigQuery API docs.
BigQuery queries are fast, but often take longer than the default App Engine urlfetch timeout. The BigQuery API is async, so you need to break up the steps into API calls that each are shorter than 5 seconds.
For this situation, I would use the App Engine Task Queue:
Make a call to the BigQuery API to insert your job. This returns a JobID.
Place a task on the App Engine task queue to check out the status of the BigQuery query job at that ID.
If the BigQuery Job Status is not "DONE", place a new task on the queue to check it again.
If the Status is "DONE," then make a call using urlfetch to retrieve the results.
Note I would go with Michael's suggestion since that is the most robust. I just wanted to point out that you can increase the urlfetch timeout up to 60 seconds, which should be enough time for most queries to complete.
How to set timeout for urlfetch in Google App Engine?
I was unable to get the urlfetch.set_default_fetch_deadline() method to apply to the Big Query API, but was able to increase the timeout when authorizing the big query session as follows:
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
credentials = ServiceAccountCredentials.from_json_keyfile_dict(credentials_dict, scopes)
# Create an authorized session and set the url fetch timeout.
http_auth = credentials.authorize(Http(timeout=60))
# Build the service.
service = build(service_name, version, http=http_auth)
# Make the query
request =
Or with an asynchronous approach using jobs().insert
query_response =
big_query_job_id = query_response['jobReference']['jobId']
# poll the job.get endpoint until the job is complete
while True:
job_status_response =\
if job_status_response['status']['state'] == done:
results_respone =\
We ended up going with an approach similar to what Michael suggests above, however even when using the asynchronous call, the getQueryResults method (paginated with a small maxResults parameter) was timing out on url fetch, throwing the error posted in the question.
So, in order to increase the timeout of URL Fetch in Big Query / App Engine, set the timeout accordingly when authorizing your session.
To issue HTTP requests in AppEngine you can use urllib, urllib2, httplib, or urlfetch. However, no matter what library you choose, AppEngine will perform HTTP requests using App Engine's URL Fetch service.
The googleapiclient uses httplib2. It looks like httplib2.Http passes it's timeout to urlfetch. Since it has a default value of None, urlfetch sets the deadline of that request to 5s no matter what you set with urlfetch.set_default_fetch_deadline.
Under the covers httplib2 uses the socket library for HTTP requests.
To set the timeout you can do the following:
import socket
You should also be able to do this but I haven't tested it:
http = httplib2.Http(timeout=30)
If you don't have existing code to time the request you can wrap your query like so:
import time
start_query = time.time()
<your query code>
end_query = time.time()
print(end_query - start_query)
This is one way to solve bigquery timeouts in AppEngine for Go. Simply set TimeoutMs on your queries to well below 5000. The default timeout for bigquery queries is 10000ms which is over the default 5 second deadline for outgoing requests in AppEngine.
The gotcha is that the timeout must be set both in the initial request: bigquery.service.Jobs.Query(…) and the subsequent b.service.Jobs.GetQueryResults(…) which you use to poll the query results.
query := &gbigquery.QueryRequest{
DefaultDataset: &gbigquery.DatasetReference{
DatasetId: "mydatasetid",
ProjectId: "myprojectid",
Kind: "json",
Query: "<insert query here>",
TimeoutMs: 3000, // <- important!
queryResponse := bigquery.service.Jobs.Query("myprojectid", query).Do()
// determine if queryResponse is a completed job and if not start to poll
queryResponseResults := bigquery.service.Jobs.
GetQueryResults("myprojectid", res.JobRef.JobId).
TimeoutMs(DefaultTimeoutMS) // <- important!
// determine if queryResponseResults is a completed job and if not continue to poll
The nice thing about this is that you maintain the default request deadline for the overall request (60s for normal requests and 10min for tasks and cronjobs) while avoiding setting the deadline for outgoing requests to some arbitrary large value.
