how to configure default bucket in sagemaker pipeline? - amazon-sagemaker

I am doing some experimentation in training models in amazon sagemaker/studio via sagemaker piplines. based on samples/aws docs , in my notebook ( sample code below) , in my jupyter notebook, i set up sagemaker session , execution role via code like shown below . also, there is a default bucket that is set up => sess.default_bucket(),
how is this default bucket configured? or can we simply point to the one we want?
import sagemaker
import os
try:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
except ValueError:
import boto3
# creates a boto3 session using the local profile we defined
if local_profile_name:
os.environ['AWS_PROFILE'] = local_profile_name # setting env var bc local-mode cannot use boto3 session
#bt3 = boto3.session.Session(profile_name=local_profile_name)
#iam = bt3.client('iam')
# create sagemaker session with boto3 session
#sess = sagemaker.Session(boto_session=bt3)
iam = boto3.client('iam')
sess = sagemaker.Session()
# get role arn
role = iam.get_role(RoleName=role_name)['Role']['Arn']
print(sess.default_bucket()) # s3 bucketname

Within the pipeline context, PipelineSession should be used wherever possible.
This class inherits the SageMaker session, it provides convenient
methods for manipulating entities and resources that Amazon SageMaker
uses, such as training jobs, endpoints, and input datasets in S3. When
composing SageMaker Model-Building Pipeline, PipelineSession is
recommended over regular SageMaker Session
At this point, you can specify the bucket directly as a parameter:
from sagemaker.workflow.pipeline_context import PipelineSession
pipeline_session = PipelineSession(default_bucket = your_default_bucket)

Related

How to deploy the hugging face model via sagemaker pipeline

Below is the code to get the model from Hugging Face Hub and deploy the same model via sagemaker.
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'siebert/sentiment-roberta-large-english',
'HF_TASK':'text-classification'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.g4dn.xlarge' # ec2 instance type
)
How can I deploy this model via sagemaker pipeline .
How can I include this code in sagemaker pipeline.
Prerequisites
SageMaker pipelines offer many different components with lots of functionality. Your question is quite general and needs to be contextualized to a specific problem.
You have to start with putting up a pipeline first.
See the complete guide "Defining a pipeline".
Quick response
My answer is to follow this official AWS guide that answers your question exactly:
SageMaker Pipelines: train a Hugging Face model, deploy it with a Lambda step
General explanation
Basically, you need to build your pipeline architecture with the components you need and register the trained model within the Model Registry.
Next, you have two paths you can follow:
Trigger a lambda that automatically deploys the registered model (as the guide does).
Out of the context of the pipeline, do the automatic deployment by retrieving the ARN of the registered model on the Model Registry. You can get it from register_step.properties.ModelPackageArn or in an external script using boto3 (e.g. using list_model_packages)

Total cpu utilization of running app engine flexible instances

I need to make decisions in an external system based on the current CPU utilization of my App Engine Flexible service. I can see the exact values / metrics I need to use in the dashboard charting in my Google Cloud Console, but I don't see a direct, easy way to get this information from something like a gcloud command.
I also need to know the count of running instances, but I think I can use gcloud app instances list -s default to get a list of my running instances in the default service, and then I can use a count of lines approach to get this info easily. I intend to make a python function which returns a tuple like (instance_count, cpu_utilization).
I'd appreciate if anyone can direct me to an easy way to get this. I am currently exploring the StackDriver Monitoring service to get this same information, but as of now it is looking super-complicated to me.
You can use the gcloud app instances list -s default command to get the running instances list, as you said. To retrieve CPU utilization, have a look on this Python Client for Stackdriver Monitoring. To list available metric types:
from google.cloud import monitoring
client = monitoring.Client()
for descriptor in client.list_metric_descriptors():
print(descriptor.type)
Metric descriptors are described here. To display utilization across your GCE instances during the last five minutes:
metric = 'compute.googleapis.com/instance/cpu/utilization'
query = client.query(metric, minutes=5)
print(query.as_dataframe())
Do not forget to add google-cloud-monitoring==0.28.1 to “requirements.txt” before installing it.
Check this code that locally runs for me:
import logging
from flask import Flask
from google.cloud import monitoring as mon
app = Flask(__name__)
#app.route('/')
def list_metric_descriptors():
"""Return all metric descriptors"""
# Instantiate client
client = mon.Client()
for descriptor in client.list_metric_descriptors():
print(descriptor.type)
return descriptor.type
#app.route('/CPU')
def cpuUtilization():
"""Return CPU utilization"""
client = mon.Client()
metric = 'compute.googleapis.com/instance/cpu/utilization'
query = client.query(metric, minutes=5)
print(type(query.as_dataframe()))
print(query.as_dataframe())
data=str(query.as_dataframe())
return data
#app.errorhandler(500)
def server_error(e):
logging.exception('An error occurred during a request.')
return """
An internal error occurred: <pre>{}</pre>
See logs for full stacktrace.
""".format(e), 500
if __name__ == '__main__':
# This is used when running locally. Gunicorn is used to run the
# application on Google App Engine. See entrypoint in app.yaml.
app.run(host='127.0.0.1', port=8080, debug=True)

AppEngine datastore - backup programmatically

I would like to backup my app's datastore programmatically, on a regular basis.
It seems possible to create a cron that backs up the datastore, according to https://developers.google.com/appengine/articles/scheduled_backups
However, I require a more fine-grained solution: Create different backup files for dynamically changing namespaces.
Is it possible to simply call the /_ah/datastore_admin/backup.create url with GET/POST?
Yes; I'm doing exactly that in order to implement some logic that couldn't be done with cron.
Use the taskqueue API to add the URL request, like this:
from google.appengine.api import taskqueue
taskqueue.add(url='/_ah/datastore_admin/backup.create',
method='GET',
target='ah-builtin-python-bundle',
params={'kind': ('MyKind1', 'MyKind2')})
If you want to use more parameters that would otherwise go into the cron url, like 'filesystem', put those in the params dict alongside 'kind'.
Programmatically backup datastore based on environment
This comes in addition to Jamie's answer. I needed to backup the datastore to Cloud Storage, based on the environment (staging/production). Unfortunately, this can no longer be achieved via a cronjob so I needed to do it programmatically and create a cron to my script. I can confirm that what's below is working, as I saw there were some people complaining that they get a 404. However, it's only working on a live environment, not on the local development server.
from datetime import datetime
from flask.views import MethodView
from google.appengine.api import taskqueue
from google.appengine.api.app_identity import app_identity
class BackupDatastoreView(MethodView):
BUCKETS = {
'app-id-staging': 'datastore-backup-staging',
'app-id-production': 'datastore-backup-production'
}
def get(self):
environment = app_identity.get_application_id()
task = taskqueue.add(
url='/_ah/datastore_admin/backup.create',
method='GET',
target='ah-builtin-python-bundle',
queue_name='backup',
params={
'filesystem': 'gs',
'gs_bucket_name': self.get_bucket_name(environment),
'kind': (
'Kind1',
'Kind2',
'Kind3'
)
}
)
if task:
return 'Started backing up %s' % environment
def get_bucket_name(self, environment):
return "{bucket}/{date}".format(
bucket=self.BUCKETS.get(environment, 'datastore-backup'),
date=datetime.now().strftime("%d-%m-%Y %H:%M")
)
You can now use the managed export and import feature, which can be accessed through gcloud or the Datastore Admin API:
Exporting and Importing Entities
Scheduling an Export

Unable to connect to Google Cloud SQL from development server using SQLAlchemy

I've been unsuccessful at connecting to Google Cloud SQL using SQLAlchemy 0.7.9 from my development workstation (in hopes to generate the schema using create_all()). I can't get passed the following error:
sqlalchemy.exc.DBAPIError: (AssertionError) No api proxy found for service "rdbms" None None
I was able to successfully connect to the database instance using the google_sql.py instancename, which initially opened up a browser to authorize the connection (which now appears to have cached the authorization, although I don't have the ~/Library/Preferences/com.google.cloud.plist file as https://developers.google.com/cloud-sql/docs/commandline indicates I should)
Here is the simple application I'm using to test the connection:
from sqlalchemy import create_engine
engine = create_engine('mysql+gaerdbms:///myapp', connect_args={"instance":"test"})
connection = engine.connect()
The full stacktrace is available here - https://gist.github.com/4486641
It turns out the mysql+gaerdbms:/// driver in SQLAlchemy was only setup to use the rdbms_apiproxy DBAPI, which can only be used when accessing Google Cloud SQL from a Google App Engine instance. I submitted a ticket to SQLAlchemy to update the driver to use the OAuth-based rdbms_googleapi when not on Google App Engine, just like the Django driver provided in the App Engine SDK. rdbms_googleapi is also the DBAPI that google_sql.py uses (remote sql console).
The updated dialect is expected to be part of 0.7.10 and 0.8.0 releases, but until they are available, you can do the following:
1 - Copy the updated dialect in the ticket to a file (ex. gaerdbms_dialect.py)
from sqlalchemy.dialects.mysql.mysqldb import MySQLDialect_mysqldb
from sqlalchemy.pool import NullPool
import re
"""Support for Google Cloud SQL on Google App Engine
Connecting
-----------
Connect string format::
mysql+gaerdbms:///<dbname>?instance=<project:instance>
# Example:
create_engine('mysql+gaerdbms:///mydb?instance=myproject:instance1')
"""
class MySQLDialect_gaerdbms(MySQLDialect_mysqldb):
#classmethod
def dbapi(cls):
from google.appengine.api import apiproxy_stub_map
if apiproxy_stub_map.apiproxy.GetStub('rdbms'):
from google.storage.speckle.python.api import rdbms_apiproxy
return rdbms_apiproxy
else:
from google.storage.speckle.python.api import rdbms_googleapi
return rdbms_googleapi
#classmethod
def get_pool_class(cls, url):
# Cloud SQL connections die at any moment
return NullPool
def create_connect_args(self, url):
opts = url.translate_connect_args()
opts['dsn'] = '' # unused but required to pass to rdbms.connect()
opts['instance'] = url.query['instance']
return [], opts
def _extract_error_code(self, exception):
match = re.compile(r"^(\d+):").match(str(exception))
code = match.group(1)
if code:
return int(code)
dialect = MySQLDialect_gaerdbms
2 - Register the custom dialect (you can override the existing schema)
from sqlalchemy.dialects import registry
registry.register("mysql.gaerdbms", "application.database.gaerdbms_dialect", "MySQLDialect_gaerdbms")
Note: 0.8 allows a dialect to be registered within the current process (as shown above). If you are running an older version of SQLAlchemy, I recommend upgrading to 0.8+ or you'll need to create a separate install for the dialect as outlined here.
3 - Update your create_engine('...') url as the project and instance are now provided as part of the query string of the URL
mysql+gaerdbms:///<dbname>?instance=<project:instance>
for example:
create_engine('mysql+gaerdbms:///mydb?instance=myproject:instance1')
I think I have a working sample script as follows. Can you try a similar thing and let me know how it goes?
However(and you might be already aware of it), if your goal is to create the schema on Google Cloud SQL instance, maybe you can create the schema against a local mysql server, dump it with mysqldump, and then import the schema to Google Cloud SQL. Using local mysql server is very convenient for development anyway.
from sqlalchemy import create_engine
from google.storage.speckle.python.api import rdbms as dbi_driver
from google.storage.speckle.python.api import rdbms_googleapi
INSTANCE = 'YOURPROJECT:YOURINSTANCE'
DATABASE = 'YOURDATABASE'
def get_connection():
return rdbms_googleapi.connect('', instance=INSTANCE, database=DATABASE)
def main():
engine = create_engine(
'mysql:///YOURPROJECT:YOURINSTANCE/YOURDATABASE',
module=dbi_driver,
creator=get_connection,
)
print engine.execute('SELECT 1').scalar()
if __name__ == '__main__':
main()

What's a namespace used for in the App Engine datastore?

In the development admin console, when I look at my data, it says "Select different namespace".
What are namespaces for and how should I use them?
Namespaces allow you to implement segregation of data for multi-tenant applications. The official documentation links to some sample projects to give you an idea how it might be used.
Namespaces is used in google app engine to create Multitenant Applications. In Multitenent applications single instance of the application runs on a server, serving multiple client organizations (tenants). With this, an application can be designed to virtually partition its data and configuration (business logic), and each client organization works with a customized virtual application instance..you can easily partition data across tenants simply by specifying a unique namespace string for each tenant.
Other Uses of namespace:
Compartmentalizing user information
Separating admin data from application data
Creating separate datastore instances for testing and production
Running multiple apps on a single app engine instance
For More information visit the below links:
http://www.javacodegeeks.com/2011/12/multitenancy-in-google-appengine-gae.html
https://developers.google.com/appengine/docs/java/multitenancy/
http://java.dzone.com/articles/multitenancy-google-appengine
http://www.sitepoint.com/multitenancy-and-google-app-engine-gae-java/
Looking, towards this question is not that much good reviewed and answered so trying to give this one.
When using namespaces, we can have a best practice of key and value separation there on a given namespace. Following is the best example of giving the namespace information thoroughly.
from google.appengine.api import namespace_manager
from google.appengine.ext import db
from google.appengine.ext import webapp
class Counter(db.Model):
"""Model for containing a count."""
count = db.IntegerProperty()
def update_counter(name):
"""Increment the named counter by 1."""
def _update_counter(name):
counter = Counter.get_by_key_name(name)
if counter is None:
counter = Counter(key_name=name);
counter.count = 1
else:
counter.count = counter.count + 1
counter.put()
# Update counter in a transaction.
db.run_in_transaction(_update_counter, name)
class SomeRequest(webapp.RequestHandler):
"""Perform synchronous requests to update counter."""
def get(self):
update_counter('SomeRequest')
# try/finally pattern to temporarily set the namespace.
# Save the current namespace.
namespace = namespace_manager.get_namespace()
try:
namespace_manager.set_namespace('-global-')
update_counter('SomeRequest')
finally:
# Restore the saved namespace.
namespace_manager.set_namespace(namespace)
self.response.out.write('<html><body><p>Updated counters')
self.response.out.write('</p></body></html>')

Resources