Sagemaker memory leak - amazon-sagemaker

Sagemaker memory leak - amazon-sagemaker

I deployed a deep learning model in the sagemaker and created a endpoint.
Unfortunately, I put it a large size image then the endpoint return 'RuntimeError: CUDA error: out of memory'.
So I would like to re-launch the endpoint, but seems there is not any restart button.
What could I do for restarting it?
Thank you

Assuming you meant UpdateEndpoint by "restart", you will not be able to update a SageMaker endpoint if it is already in 'Failed' status. This is documented in SageMaker API references.
If you have already identified the cause of endpoint failure, you can delete the failed endpoint and create a new one with the correct model.

Related

How do you resolve an "Access Denied" error when invoking `image_uris.retrieve()` in AWS Sagemaker JumpStart?

I am working in a SageMaker environment that is locked down. For example, my user account is prevented from creating S3 buckets. But, I can successfully run vanilla ML training jobs by passing in role=get_execution_role to an instance of the Estimator class when using an out-of-the-box algorithm such as XGBoost.
Now, I'm trying to use an algorithm (LightBGM) that is only available via the JumpStart feature in SageMaker, but I can't get it to work. When I try to retrieve an image URI via image_uris.retrieve(), it returns the following error:
ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied.
This makes some sense to me if my user permissions are being used when creating an object. But what I want to do is specify another role - like the one returned from get_execution_role - to perform these tasks.
Is that possible? Is there another work-around available? How can I see which role is being used?
Thanks,

When I encountered this issue, it was a permissions issue with a bucket that had changed.
In the SageMaker Python SDK source code , there is a cache that is located at in an AWS-owned bucket: jumpstart-cache-prod-{region}. and a manifest.json that translates the ECR path for the image for you.
If you look at the stack trace, it could be erroring out at the code that is looking for the manifest.
One place to look is if there are new restrictions placed in IAM, Included here is the minimum policy you need to access JumpStart (pretrained) models

Creating a Message-Hub Bridge for IBM Cloud Object Storage

I'm trying to create a "Bridge" from Message Hub to S3 Object Storage, copying information from the credentials that I created but I always get an error that says "Please trying refreshing the page, or logging back into Bluemix."
I have already created an access policy for these credentials and the Bucket I want to use as destination.
Also tried with private and public end-points.
I wasn't able to found documentation that explains how to accomplish this. Nothing seems to work.
Thanks!

Apologies, this is an internal error caused by the S3 Object Storage bridges capability being made available in the UI but not in the backend.
An update to the Message Hub service will be made this week to correct this.

Getting 403 not authorized when indexing documents on Retrieve and Rank

I am suddenly getting a 403 error when I try to POST an update to the Retrieve and Rank service. This code is under development but it has been working up until yesterday. The failure occurs only when doing a POST to /v1/solr_clusters/{solr_cluster_id}/solr/{collection_name}/update, and it fails the same way whether I do it via my program, the Swagger API documentation, or cURL. All other operations to this service that I've tried work fine when using the same credentials that I'm using with this POST. The error message I'm getting back is
Error: WRRCSH004: Service [1d111267-76b7-417a-98bd-4e9a58072ef9] is not authorized for cluster [sc262b05e8_dcf5_40b4_b662_ae85058ff07f]!. I don't know where the identifier (1d111267-76b7-417a-98bd-4e9a58072ef9) is coming from; that's not the userid I'm sending in.

Looking into your issue it appears your Bluemix organization has multiple service instances. The 403 issue you were seeing is because you're trying to access a Solr cluster using credentials from one of your instances against a cluster in the other instance. The 1d111267-76b7-417a-98bd-4e9a58072ef9 represents one of these service instances—but the issue is that the cluster you're trying to access is not part of that instance. A good way to test this is to ensure you're using the same credentials that generate the 403 but simply try to list the Solr clusters you have created by doing a GET against https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/.
As for the 500 issue, I wasn't able to see anything on our end. If you're still experiencing that I would suggest posting another question and we can look into things again.
Thanks,
-Scott

How to set CacheControl in the new Cloud Storage Api (Python)?

I'm following the guidelines and updating my code to use the new Cloud Storage API in GAE, i do need to set the cachecontrol headers, previously it was easy:
files.gs.create(filename, mime_type='image/png', acl='public-read', cache_control='public, max-age=100000, must-revalidate' )
BUT, with the new API, the guidelines says that the "cache_control" is not available...
I get this error when tried to put the cachecontrol inside the Options:
ValueError: option cache_control is not supported.
Tried with Cache-Control and the same error...
As usual, the documentation of the new API is not good.
Can someone help me how to set the cache headers in the new Cloud Storage API using PYTHON. In case is not possible, can I still use the old api for my project?
Thanks.

You are right. As documented here,
the open function only supports x-goog-acl and x-goog-meta headers.
Cache control is likely to be added in the near future to make migration easier. Please note that the main value of the GCS client lib is buffered read, buffered resumable write, and automatically retries to overcome transient errors. Many other simple REST operations on GCS (e.g cache, file copy, create bucket ...) can already be done by Google API Client. The "downside" of Google API Client is that since it doesn't come directly from/for App Engine, it does not have dev appserver support.

Google App Engine RPC Service

I've created an application using Google Web Toolkit and Google App Engine that saves objects based on user selections into a RPC Service Implementation.
It was my understanding that everytime GWT "creates" this service, the data is reinstantiated with the default values. Unfortunately it seems like when a user on one computer saves a change to the data, another user on another computer is seeing the data change on their end. Im not using a datastore or anything so why is this happening?
EDIT: After some research I am seeing that I need to use sessions to handle the delivery of the objects. However, in my RemoteServiceServlet I am calling this.getThreadLocalRequest and it's returning null. Why does this.getThreadLocalRequest() return null??
UPDATE: Answering my own questions here : ) You cannot getThreadLocalRequest() in the constructor of your Servlet. duh.

The problem was that I was calling getThreadLocalRequest() in the constructor of my servlet. Good thing to be aware of - the local request doesnt happen until the servlet is loaded.