how to run a batch transform job in sagemaker pipeline via custom inference code? - amazon-sagemaker

based on the aws documentation/example provided here , https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.html#Define-a-Transform-Step-to-Perform-Batch-Transformation, a model is created and a batch transform inference can be run on the trained model. it works for this example but if we need a custom inference script, How do we include a custom inference script in the model or model package before we run the batch transformation ?
from sagemaker.transformer import Transformer
from sagemaker.inputs import TransformInput
from sagemaker.workflow.steps import TransformStep
transformer = Transformer(
model_name=step_create_model.properties.ModelName,
instance_type="ml.m5.xlarge",
instance_count=1,
output_path=f"s3://{default_bucket}/AbaloneTransform",
)
step_transform = TransformStep(
name="AbaloneTransform", transformer=transformer, inputs=TransformInput(data=batch_data)
)

You need a "model repacking step".
From the Amazon SageMaker Workflows FAQ:
Model repacking happens when the pipeline needs to include a custom
script in the compressed model file (model.tar.gz) to be uploaded to
Amazon S3 and used to deploy a model to a SageMaker endpoint. When
SageMaker pipeline trains a model and registers it to the model
registry, it introduces a repack step if the trained model output from
the training job needs to include a custom inference script. The
repack step uncompresses the model, adds a new script, and
recompresses the model. Running the pipeline adds the repack step as a
training job.
Basically, you can redefine a sagemaker Model by calling the training output as model_data and pass the inference script as entry_point.
Then sequentially, after training the model, redefine the model by changing the entry_point and on the latter you can use the transformer.
This is an example flow taken from a tested code:
my_model = Model(
image_uri=your_img_uri,
model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
role=role,
entry_point='inference_script.py',
name="your_inference_step_name"
)
step_create_model = ModelStep(
name="YourInfName",
step_args=my_model.create(instance_type="ml.m5.xlarge")
)
transformer = Transformer(
model_name=step_create_model.properties.ModelName,
instance_count=your_instance_count,
instance_type=your_instance_type,
output_path=your_path
)
Of course, instead of the generic Model, you can directly use the one that best suits your requirements (e.g. PyTorchModel etc...).

Related

Sagemaker Studio trial component chart not showing

I am wondering why I am unable to show the loss and accuracy curve in Sagemaker Studio, Trial components chart.
I am using tensorflow's keras API for training.
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
entry_point="sm_entrypoint.sh",
source_dir=".",
role=role,
instance_count=1,
instance_type="ml.m5.4xlarge",
framework_version="2.4",
py_version="py37",
metric_definitions=[
{'Name':'train:loss', 'Regex':'loss: ([0-9.]+'},
{'Name':'val:loss', 'Regex':'val_loss: ([0-9.]+'},
{'Name':'train:accuracy', 'Regex':'accuracy: ([0-9.]+'},
{'Name':'val:accuracy', 'Regex':'val_accuracy: ([0-9.]+'}
],
enable_sagemaker_metrics=True
)
estimator.fit(
inputs="s3://xxx",
experiment_config={
"ExperimentName": "urbansounds-20211027",
"TrialName": "tf-classical-NN-20211027",
"TrialComponentDisplayName": "Train"
}
)
Regex is enabled, and appears to be logging them correctly. Since under the metrics tab, it shows 12 counts for each metric, corresponding to 12 epochs cycle which I specified.
However, the chart is empty. The x-axis is in time here, but it is also empty when I switched to epoch.
tldr: in your entry_point source code sm_entrypoint.sh, you need to explicitly inform the experiment tracker which epoch the metric is associated with, using the log_metric() function.
There are two ways tracker work in SageMaker Experiment Tracker: (1) you log the metric in your entry_point code, and use metric_definitions argument in the estimator to teach SM to parse the metric from the logs, as the way you did it, or, you can (2) explicitly create a Tracker instance inside your entry_point, and invoke the log_metric() method. Apparently only method (2) tells SM Tracker what epoch each metric entry is registered to.
I found the answer from a random video https://youtu.be/gMnkfPztIHU?t=141, after days of search :(
Oh there is also a catch: SageMaker images do not have the Tracker package installed, so if you just have from smexperiments.tracker import Tracker in your entry_point source code, your SM Estimator will complain. So you will need to install sagemaker-experiments for your image, by
create a requirements.txt file that has sagemaker-experiments==0.1.35 in it;
specify the source dir by including source_dir="./dir_that_contains_requirements.txt" in your estimator creation.

Report to find objects of a segw project and put them on a request

I'm new in SAP and during my practicing I came up with a 'problem': when I was transporting my project to another system I had to manually include some objects that were in other requests.
So now I'm trying to make a report to join all the objects related to a segw project in a single request. My idea is passing the Project ID or name to my report find the objects, create a request and put all of them into it.
I've already found something. When creating a segw project and generate it, the request has:
- Class (ABAP objects) with the DPC and MPC
- SAP Gateway Business Suite Enablement - Model
- SAP Gateway BSE - Service Builder Project
- SAP Gateway Business Suite Enablement - Service
I've found two tables that help me to get DPC and MPC objects:
TMDIR, VSEOCLASS.
Am I in the right path? Is there a way to find all related objects to the project or I'll need to find them separately like the DPC and MPC I've already found?
Thanks!
Assuming all your SEGW objects reside in a single package which is usually the case when you create SEGW project from scratch:
DATA: l_trkorr TYPE trkorr,
l_package TYPE devclass VALUE 'ZSEGW_P'.
cl_pak_package_queries=>get_all_subpackages( EXPORTING im_package = l_package
IMPORTING et_subpackages = DATA(lt_descendant) ).
INSERT VALUE cl_pak_package_queries=>ty_subpackage_info( package = l_package ) INTO TABLE lt_descendant.
SELECT pgmid, object, obj_name FROM tadir
INTO TABLE #DATA(lt_segw_objects)
FOR ALL ENTRIES IN #lt_descendant
WHERE devclass = #lt_descendant-package.
DATA(instance) = cl_adt_cts_management=>create_instance( ).
LOOP AT lt_segw_objects ASSIGNING FIELD-SYMBOL(<fs_obj>).
TRY.
instance->insert_objects_in_wb_request( EXPORTING pgmid = <fs_obj>-pgmid
object = <fs_obj>-object
obj_name = CONV trobj_name( <fs_obj>-obj_name )
IMPORTING result = DATA(result)
request = DATA(request)
CHANGING trkorr = l_trkorr ).
CATCH cx_adt_cts_insert_error.
ENDTRY.
ENDLOOP.
This snippet creates transport request with number l_trkorr and as soon as this var is not changed put all remainign objects in the same request.
WARNING: this will not work if the objects are locked (in another request) and will give you cx_adt_cts_insert_error exception. There is no way to unlock objects programmatically, only via SE03 tool.

How do we get the document file url using the Watson Discovery Service?

I don't see a solution to this using the available api documentation.
It is also not available on the web console.
Is it possible to get the file url using the Watson Discovery Service?
If you need to store the original source/file URL, you can include it as a field within your documents in the Discovery service, then you will be able to query that field back out when needed.
I also struggled with this request but ultimately got it working using Python bindings into Watson Discovery. The online documentation and API reference is very poor; here's what I used to get it working:
(Assume you have a Watson Discovery service and have a created collection):
# Programmatic upload and retrieval of documents and metadata with Watson Discovery
from watson_developer_cloud import DiscoveryV1
import os
import json
discovery = DiscoveryV1(
version='2017-11-07',
iam_apikey='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
url='https://gateway-syd.watsonplatform.net/discovery/api'
)
environments = discovery.list_environments().get_result()
print(json.dumps(environments, indent=2))
This gives you your environment ID. Now append to your code:
collections = discovery.list_collections('{environment-id}').get_result()
print(json.dumps(collections, indent=2))
This will show you the collection ID for uploading documents into programmatically. You should have a document to upload (in my case, an MS Word document), and its accompanying URL from your own source document system. I'll use a trivial fictitious example.
NOTE: the documentation DOES NOT tell you to append , 'rb' to the end of the open statement, but it is required when uploading a Word document, as in my example below. Raw text / HTML documents can be uploaded without the 'rb' parameter.
url = {"source_url":"http://mysite/dis030.docx"}
with open(os.path.join(os.getcwd(), '{path to your document folder with trailing / }', 'dis030.docx'), 'rb') as fileinfo:
add_doc = discovery.add_document('{environment-id}', '{collections-id}', metadata=json.dumps(url), file=fileinfo).get_result()
print(json.dumps(add_doc, indent=2))
print(add_doc["document_id"])
Note the setting up of the metadata as a JSON dictionary, and then encoding it using json.dumps within the parameters. So far I've only wanted to store the original source URL but you could extend this with other parameters as your own use case requires.
This call to Discovery gives you the document ID.
You can now query the collection and extract the metadata using something like a Discovery query:
my_query = discovery.query('{environment-id}', '{collection-id}', natural_language_query="chlorine safety")
print(json.dumps(my_query.result["results"][0]["metadata"], indent=2))
Note - I'm extracting just the stored metadata here from within the overall returned results - if you instead just had:
print(my_query) you'll get the full response from Discovery ... but ... there's a lot to go through to identify just your own custom metadata.

How to reference custom Java class in Fusion Solr Javascript Index Stage?

In Fusion's Javascript Indexing Stage, we can import Java classes and run them in the javascript such as this :
var imports = new JavaImporter(java.lang.String);
with (imports) {
var name = new String("foo"); ...
}
If we have customized complex Java classes, how to include the compile jar with Fusion so that the class can be imported in Javascript Indexing Stages for use?
And where can we store configuration values for the Javascript Indexing Stage to look up and how to retrieve them?
I'm thinking of something like this:
var imports = new JavaImporter(mycompany.com.custompkg.SomeParser);
with (imports) {
var some_config = ResourceManager.GetString("key");
var sp = new SomeParser(some_config); ...
}
Regards,
Kelvin
Starting in Fusion 4.x The API and Connectors started using a common location for jars i.e. apps/libs . This is a reasonable place to put custom jars but the services must be told about the new jars as well. That's done in two places
/jetty/connectors-classic/webapps/connectors-extra-classpath.txt
./jetty/api/webapps/api-extra-classpath.txt
Also, index documents can get processed by the api service so even if the jar is only used for indexing, register with both classpaths. Finally, bounce the services.
Put the Java class file, as a jar file, in $FUSION_HOME/apps/jetty/api/webapps/api/WEB-INF/lib/.
I used this to access my custom class.
var SomeParser = Java.type('mycompany.com.custompkg.SomeParser');

GAE Python Testing

First I'm quite new to GAE/Python, please bear with me. Here's the situation
Have a test_all.py which tests all the test suites in my package. The TestCase's setUp current look like;
import test_settings as policy # has consistency variables <- does not work
...
def setUp(self):
# First, create an instance of the Testbed class.
self.testbed = testbed.Testbed()
# Then activate the testbed, which prepares the service stubs for use.
self.testbed.activate()
# Consistency policy nomal operations
self.policy = None
if policy.policy_flag == 'STRICT':
# Create a consistency policy that will simulate the High Replication consistency model.
self.policy = datastore_stub_util.PseudoRandomHRConsistencyPolicy(probability=0)
# Initialize the datastore stub with this policy.
self.testbed.init_datastore_v3_stub(consistency_policy=self.policy)
This is my primitive attempt to setup the datastore with different consistency policies to run tests against them. Total fail.
What I want is to run my test cases against different datastore consistencies at one go from my root test_all.py Using the above or other, how can I do this? How do I pass in parameters to the TestCase from the test runner?
Python, unit test - Pass command line arguments to setUp of unittest.TestCase
The above thread shows exactly how to best pass runtime arguments to the test case. Specifically answer 2.

Resources